hybrid solution for scalable research ......develop effective and efficient analytical models and...

226
UNIVERSITETI I EVROPËS JUGLINDORE УНИВЕРЗИТЕТ НА ЈУГОИСТОЧНА ЕВРОПА SOUTH EAST EUROPEAN UNIVERSITY FAKULTETI I SHKENCAVE DHE TEKNOLOGJIVE BASHKËKOHORE ФАКУЛТЕТ ЗА СОВРЕМЕНИ НАУКИ И ТЕХНОЛОГИИ FACULTY OF CONTEMPORARY SCIENCES AND TECHNOLOGIES Third Cycle of Academic Studies – Doctoral Studies Doctoral Dissertation Topic: HYBRID SOLUTION FOR SCALABLE RESEARCH ARTICLES RECOMMENDATION CANDIDATE: MENTOR: Nuhi Besimi, MSc Prof. Dr. Betim Çiço October, 2020

Upload: others

Post on 31-Jan-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

  • UNIVERSITETI I EVROPËS JUGLINDORE УНИВЕРЗИТЕТ НА ЈУГОИСТОЧНА ЕВРОПА

    SOUTH EAST EUROPEAN UNIVERSITY

    FAKULTETI I SHKENCAVE DHE TEKNOLOGJIVE BASHKËKOHORE ФАКУЛТЕТ ЗА СОВРЕМЕНИ НАУКИ И ТЕХНОЛОГИИ FACULTY OF CONTEMPORARY SCIENCES AND TECHNOLOGIES

    Third Cycle of Academic Studies – Doctoral Studies

    Doctoral Dissertation Topic:

    HYBRID SOLUTION FOR SCALABLE RESEARCH ARTICLES RECOMMENDATION

    CANDIDATE: MENTOR: Nuhi Besimi, MSc Prof. Dr. Betim Çiço

    October, 2020

  • 2

  • 3

    Abstract

    In recent decades, machine learning has become a crucial factor in automating business

    operations and assisting in the decision-making process. The massive volume of data generated

    at an unprecedented rate has motivated researchers and industry analysts to continually

    develop effective and efficient analytical models and machine learning techniques.

    In text mining, clustering and classification are essential techniques to extract information from

    textual data. These are techniques, which allow us to identify similar groups of textual

    documents or build classification models based on some similarity. Application of machine

    learning techniques on textual data has become a crucial factor in extracting useful and

    unknown information from textual documents. With the massive volume of unstructured data

    generated on the Web, researchers and the industry have been motivated to develop efficient

    techniques for structuring and processing such data to extract meaningful information.

    In this thesis we present a hybrid model based on clustering and classification techniques to

    recommend research articles to researchers. Since the literature review process is time-

    consuming, we aim to automate this process and recommend the most relevant research

    articles based on users’ research field preferences. All extracted information from raw

    unstructured research articles has been represented in a graph structure. The graph

    representation of the machine learning outcomes represents a suitable form for the

    recommendation process.

    This research adds to the machine learning community by evaluating some of the most

    significant text mining techniques for unsupervised and supervised learning that will supposedly

    ease the process of literature review for researchers. Furthermore, it evaluates the accuracy

    and execution time for all the model phases by comparing multiple techniques. It also

    compares the execution of the model in terms of cost and energy consumption in three

    different environments, namely, cloud instance, cloud functions, and distributed Raspberry PIs.

    Results showed that our proposed model could have a positive impact on easing the processing

    of literature reviews and identifying trend topics for a given field. Overall, we found out that

  • 4

    both unsupervised learning and supervised learning showed promising results in accuracy when

    working with textual data. On the other hand, this solution does not perform very well in

    execution time as the volume of data increases.

    Results yielded in our study showed that distributed Raspberry PIs could have a highly positive

    impact in terms of lowering costs and being energy efficient. Overall, we found out that

    machine-learning algorithms could be adapted and run on distributed raspberry PIs with low

    cost and low energy consumption compared to cloud alternatives. On the other hand, this

    solution does not offer high scalability, and it requires more time on management, deployment,

    and configuration.

    Graph structure for representing the extracted information from machine-learning techniques

    is one of the most suitable forms for machine-learning tasks and recommendation systems. It

    allows us to query the data easily, represent all the relationships better and achieve

    performance and scalability for recommendation systems. Other data structures have shown a

    lack of performance and an increase of complexity in the process of recommending and storing

    the extracted information.

    Keywords: recommendation system, supervised learning, unsupervised learning, text mining,

    graph databases

  • 5

    Contents 1. INTRODUCTION ..................................................................................................................... 16

    1.1 Problem description ............................................................................................................ 19

    1.2 Hypothesis ........................................................................................................................... 22

    1.3 Research Questions ............................................................................................................. 22

    1.4 Methodology ....................................................................................................................... 22

    1.5 Thesis Structure ................................................................................................................... 24

    1.6 Conclusion ........................................................................................................................... 25

    2. LITERATURE REVIEW ................................................................................................................. 27

    2.1 Document Clustering ........................................................................................................... 32

    2.1.1 K-means Clustering ....................................................................................................... 33

    2.1.2 K-Means++ Clustering ................................................................................................... 34

    2.1.3 K-Medoids Clustering ................................................................................................... 35

    2.1.4 Hierarchical Clustering .................................................................................................. 38

    2.1.5 Text representation formats ........................................................................................ 39

    2.2 Supervised Learning ............................................................................................................ 41

    2.2.1 k-NN Classifier ............................................................................................................... 45

    2.2.2 Centroid Classifier ......................................................................................................... 46

    2.2.3 Naive Bayes ................................................................................................................... 47

    2.2.4 SVM – Support Vector Machine ................................................................................... 49

    2.2.5 Convolutional Neural Network ..................................................................................... 50

    2.3 Hadoop ................................................................................................................................ 51

    2.3.1 Hadoop Architecture .................................................................................................... 53

    2.3.2 Map Reduce and Spark ................................................................................................. 55

    2.3.3 Real-time Data Stream Processing: Spark Streaming ................................................... 56

    2.4 Storage Systems .................................................................................................................. 56

    2.4.1 Hadoop vs. Relational Database Management Systems .............................................. 57

    2.4.2 Hadoop vs. Data Warehouse ........................................................................................ 59

    2.4.3 Cloud Solutions ............................................................................................................. 61

  • 6

    2.4.4 Graph Databases .......................................................................................................... 62

    2.5 Conclusion ........................................................................................................................... 64

    3. METHODOLOGY ........................................................................................................................ 65

    3.1 Proposed Model .................................................................................................................. 68

    3.2 Phase 1 – Initial text clustering ........................................................................................... 69

    3.2.1 What is the right number of clusters? .......................................................................... 71

    3.2.2 Outlier Clusters ............................................................................................................. 73

    3.3 Phase 2 – A supervised learning model .............................................................................. 74

    3.4 Phase 3 – Graph representation and topic modeling ......................................................... 76

    3.5. Proposed Distributed Model .............................................................................................. 78

    3.6 Text Pre-Processing ............................................................................................................. 81

    3.7 Datasets ............................................................................................................................... 81

    3.8 Experimental setup ............................................................................................................. 83

    3.9 Proof of concept – Unsupervised Learning ......................................................................... 84

    3.10 Proof of concept – Supervised Learning ........................................................................... 91

    3.11 Conclusion ......................................................................................................................... 94

    4. EXPERIMENTS AND RESULTS .................................................................................................... 96

    4.1 Results Phase 1 – Text Clustering ........................................................................................ 96

    4.2 Results Phase 2 – Supervised Learning ............................................................................. 105

    4.3 Results Phase 3 – Graph representation and topic modeling ........................................... 118

    4.4 Experiments in cost and energy consumption .................................................................. 123

    4.5 Conclusion ......................................................................................................................... 127

    5. DISCUSSION OF FINDINGS....................................................................................................... 129

    5.1 Evaluation of Machine Learning Techniques .................................................................... 129

    5.2 Cost and Energy Consumption .......................................................................................... 132

    5.3 Text Pre-Processing ........................................................................................................... 133

    5.4 Findings on Data Storage .................................................................................................. 134

    5.5 Limitations ......................................................................................................................... 136

    5.6 Conclusion ......................................................................................................................... 136

  • 7

    6. CONCLUSION ........................................................................................................................... 138

    PUBLICATIONS AND PRESENTATIONS......................................................................................... 143

    ACKNOWLEDGEMENT ................................................................................................................. 144

    REFERENCES ................................................................................................................................ 145

    APPENDIX A ................................................................................................................................. 160

    APPENDIX B ................................................................................................................................. 165

    APPENDIX C ................................................................................................................................. 169

    APPENDIX D ................................................................................................................................. 225

  • 8

    List of Figures

    Figure 1. Overall Architecture ....................................................................................................... 21

    Figure 2. Paper distribution by year ............................................................................................. 29

    Figure 3. Paper classification ........................................................................................................ 29

    Figure 4. K-Mean Clustering Process ............................................................................................ 34

    Figure 5. K-Means vs. K-Medoids Algorithms ............................................................................... 35

    Figure 6. K-Medoids Process ........................................................................................................ 37

    Figure 7. Hierarchical Clustering ................................................................................................... 38

    Figure 8. Agglomerative Hierarchical clustering is bottom-up approach (left) Hierarchical

    clustering is top-down approach (right) ....................................................................................... 39

    Figure 9. Supervised Learning process.......................................................................................... 42

    Figure 10. Decision tree (example) [Data mining Concepts and Techniques] .............................. 44

    Figure 11. k-NN example ............................................................................................................... 46

    Figure 12. Support Vector Machine Margin ................................................................................. 49

    Figure 13. Convolutional Neural Network .................................................................................... 51

    Figure 14. Hadoop Core Components .......................................................................................... 53

    Figure 15. Hadoop Architecture.................................................................................................... 54

    Figure 16. Hadoop Cluster ........................................................................................................... 54

    Figure 17. Spark Streaming ........................................................................................................... 56

    Figure 18. Data Warehouse Architecture ..................................................................................... 60

    Figure 19. Neo4j ............................................................................................................................ 63

    Figure 20. Proposed Model ........................................................................................................... 63

    Figure 21-a. Phase1 ....................................................................................................................... 68

    Figure 21-b. Labeling Clusters ....................................................................................................... 70

    Figure 22. Vector representation of textual data ......................................................................... 73

    Figure 23. Phase 2 ......................................................................................................................... 75

    Figure 24. Graph Structure ........................................................................................................... 76

    Figure 25. Phase 3 Identifying trend topics (Add a reference) ..................................................... 78

    Figure 26. Typical Master-Slave Architecture ............................................................................... 79

    Figure 27. Master-Slave Architecture model with Raspberry PIs ................................................. 80

    Figure 28. Phase 1 Clustering Accuracy ........................................................................................ 99

    Figure 29. Execution Time in seconds ......................................................................................... 101

    Figure 30. Visualization, Top Generated Clusters ....................................................................... 104

    Figure 31. Phase 2 Supervised Learning Accuracy ...................................................................... 107

    Figure 32. Phase 2 Execution Time in seconds ........................................................................... 109

    Figure 33. Natural Language Processing Graph. Group of Papers which all belong to a specific

    field in NLP .................................................................................................................................. 119

  • 9

    Figure 34. Medical Graph. ........................................................................................................... 120

    Figure 35. Medical and the bridge papers with other fields ...................................................... 121

    Figure 36. Natural Language processing and the bridge papers with other fields..................... 121

    Figure 37. Computer Vision and the bridge papers with other fields ........................................ 122

    Figure 38. Playing Games and the bridge papers with other fields ............................................ 122

    Figure 39. Cost Comparison for 1 year ....................................................................................... 123

    Figure 40. Cost Comparison for 1 year ....................................................................................... 125

    Figure 41. Power consumption (Watt per hour) of Physical servers with near 100% CPU

    utilization. Source: https://www.anandtech.com/show/7285/intel-xeon-e5-2600-v2-12-core-

    ivy-bridge-ep/11 ......................................................................................................................... 126

  • 10

    List of Tables

    Table 1 Research articles by field ................................................................................................. 28

    Table 2. RDBMS vs. Hadoop .......................................................................................................... 58

    Table 3. RDMS vs. MapReduce ..................................................................................................... 58

    Table 4. Data Warehouse vs Hadoop [71] .................................................................................... 60

    Table 5. Dataset organization ....................................................................................................... 82

    Table 6. k-NN accuracy (3 classes) ................................................................................................ 86

    Table 7. k-NN accuracy (5 classes) ................................................................................................ 87

    Table 8. k-NN accuracy (3 classes only keywords) ........................................................................ 88

    Table 9. k-NN accuracy (5 classes only keywords) ........................................................................ 88

    Table 10. News articles – Experiment 1 ........................................................................................ 91

    Table 11. News articles – Experiment 2 ........................................................................................ 91

    Table 12. Testing the accuracy of classifiers ................................................................................. 91

    Table 13. Classify Politics News Articles (Total news articles:49) ................................................. 92

    Table 14. Classify Technology News Articles (Total news articles: 86) ......................................... 92

    Table 15. Classify Sports News Articles (Total news articles: 102) ............................................... 92

    Table 16. Execution time (in seconds) .......................................................................................... 93

    Table 17. Phase 1 Unsupervised Learning Accuracy ..................................................................... 98

    Table 18. Efficiency of Silhouette Coefficient (input: 7 clusters) .................................................. 99

    Table 19. Phase 1 Unsupervised Learning Execution Time in seconds ...................................... 100

    Table 20. Generated clusters from Dataset 1 ............................................................................. 102

    Table 21. Top Generated Clusters from Dataset 1 ..................................................................... 102

    Table 22. Phase 2 Supervised Learning Accuracy ....................................................................... 105

    Table 23. Phase 2 Supervised Learning Average Accuracy ......................................................... 108

    Table 24. Naive Bayes - 7 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 –

    Methodology, 3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games, 6 -

    Speech ......................................................................................................................................... 110

    Table 25. Naive Bayes (7 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 –

    Methodology, 3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games, 6 -

    Speech ......................................................................................................................................... 110

    Table 26. SVM - 7 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 – Methodology,

    3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games, 6 - Speech ................ 111

    Table 27. SVM (7 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 –

    Methodology, 3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games, 6 -

    Speech ......................................................................................................................................... 112

  • 11

    Table 28. Logistic Regression - 7 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 –

    Methodology, 3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games, 6 -

    Speech ......................................................................................................................................... 112

    Table 29. Logistic Regression (7 classes) - Classification Report 0 - Computer Vision, 1 – Medical,

    2 – Methodology, 3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games, 6 -

    Speech ......................................................................................................................................... 113

    Table 30. Decision Tree - 7 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 –

    Methodology, 3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games, 6 -

    Speech ......................................................................................................................................... 114

    Table 31. Decision Tree (7 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 –

    Methodology, 3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games, 6 -

    Speech ......................................................................................................................................... 114

    Table 32. KNN - 7 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 – Methodology,

    3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games, 6 - Speech ................ 115

    Table 33.KNN (7 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 –

    Methodology, 3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games, 6 -

    Speech ......................................................................................................................................... 116

    Table 34. Random Forest - 7 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 –

    Methodology, 3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games, 6 -

    Speech ......................................................................................................................................... 117

    Table 35. Random Forest (7 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 –

    Methodology, 3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games, 6 -

    Speech ......................................................................................................................................... 117

    Table 36. PRO attributes for different execution platforms ....................................................... 126

    Table 37. Comparison of environments ..................................................................................... 133

    Table 38. Comparison of OLAP and OLTP ................................................................................... 135

    Table 39. Random Forest - 2 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical ...... 169

    Table 40. Random Forest (2 classes) - Classification Report 0 - Computer Vision, 1 – Medical 169

    Table 41. Random Forest - 3 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 -

    Methodology ............................................................................................................................... 169

    Table 42. Random Forest (3 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 -

    Methodology ............................................................................................................................... 170

    Table 43. Random Forest - 4 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 –

    Methodology, 3 – Natural Language Processing ........................................................................ 170

    Table 44. Random Forest (4 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 –

    Methodology, 3 – Natural Language Processing ........................................................................ 171

    Table 45. Random Forest - 5 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 –

    Methodology, 3 - Miscellaneous, 4 – Natural Language Processing .......................................... 171

  • 12

    Table 46. Random Forest (5 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 –

    Methodology, 3 - Miscellaneous, 4 – Natural Language Processing .......................................... 172

    Table 47. Random Forest - 6 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 –

    Methodology, 3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games ........... 172

    Table 48. Random Forest (6 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 –

    Methodology, 3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games ........... 173

    Table 49. KNN - 2 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical ........................ 174

    Table 50. KNN (2 classes) - Classification Report 0 - Computer Vision, 1 – Medical .................. 174

    Table 51. KNN - 3 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 - Methodology

    ..................................................................................................................................................... 174

    Table 52. KNN (3 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 -

    Methodology ............................................................................................................................... 175

    Table 53. KNN - 4 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 – Methodology,

    3 – Natural Language Processing ................................................................................................ 175

    Table 54. KNN (4 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 –

    Methodology, 3 – Natural Language Processing ........................................................................ 176

    Table 55. KNN - 5 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 – Methodology,

    3 - Miscellaneous, 4 – Natural Language Processing .................................................................. 176

    Table 56. KNN (5 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 –

    Methodology, 3 - Miscellaneous, 4 – Natural Language Processing .......................................... 177

    Table 57. KNN - 6 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 – Methodology,

    3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games .................................. 177

    Table 58. KNN (6 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 –

    Methodology, 3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games ........... 178

    Table 59. Decision Tree - 2 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical ......... 179

    Table 60. Decision Tree (2 classes) - Classification Report 0 - Computer Vision, 1 – Medical ... 179

    Table 61. Decision Tree - 3 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 -

    Methodology ............................................................................................................................... 179

    Table 62.Decision Tree (3 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 -

    Methodology ............................................................................................................................... 180

    Table 63. Decision Tree - 4 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 –

    Methodology, 3 – Natural Language Processing ........................................................................ 180

    Table 64. Decision Tree (4 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 –

    Methodology, 3 – Natural Language Processing ........................................................................ 181

    Table 65. Decision Tree - 5 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 –

    Methodology, 3 - Miscellaneous, 4 – Natural Language Processing .......................................... 181

    Table 66. Decision Tree (5 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 –

    Methodology, 3 - Miscellaneous, 4 – Natural Language Processing .......................................... 182

  • 13

    Table 67. Decision Tree - 6 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 –

    Methodology, 3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games ........... 182

    Table 68. Decision Tree (6 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 –

    Methodology, 3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games ........... 183

    Table 69. Logistic Regression - 2 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical 184

    Table 70. Logistic Regression (2 classes) - Classification Report 0 - Computer Vision, 1 – Medical

    ..................................................................................................................................................... 184

    Table 71. Logistic Regression - 3 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 -

    Methodology ............................................................................................................................... 184

    Table 72. Logistic Regression (3 classes) - Classification Report 0 - Computer Vision, 1 – Medical,

    2 - Methodology .......................................................................................................................... 185

    Table 73. Logistic Regression - 4 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 –

    Methodology, 3 – Natural Language Processing ........................................................................ 185

    Table 74. Logistic Regression (4 classes) - Classification Report 0 - Computer Vision, 1 – Medical,

    2 – Methodology, 3 – Natural Language Processing .................................................................. 186

    Table 75. Logistic Regression - 5 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 –

    Methodology, 3 - Miscellaneous, 4 – Natural Language Processing .......................................... 186

    Table 76. Logistic Regression (5 classes) - Classification Report 0 - Computer Vision, 1 – Medical,

    2 – Methodology, 3 - Miscellaneous, 4 – Natural Language Processing .................................... 187

    Table 77. Logistic Regression - 6 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 –

    Methodology, 3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games ........... 187

    Table 78. Logistic Regression (6 classes) - Classification Report 0 - Computer Vision, 1 – Medical,

    2 – Methodology, 3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games ..... 188

    Table 79. SVM - 2 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical ....................... 189

    Table 80. SVM (2 classes) - Classification Report 0 - Computer Vision, 1 – Medical .................. 189

    Table 81. SVM - 3 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 - Methodology

    ..................................................................................................................................................... 189

    Table 82. SVM (3 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 -

    Methodology ............................................................................................................................... 190

    Table 83. SVM - 4 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 – Methodology,

    3 – Natural Language Processing ................................................................................................ 190

    Table 84. SVM (4 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 –

    Methodology, 3 – Natural Language Processing ........................................................................ 191

    Table 85. SVM - 5 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 – Methodology,

    3 - Miscellaneous, 4 – Natural Language Processing .................................................................. 191

    Table 86. SVM (5 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 –

    Methodology, 3 - Miscellaneous, 4 – Natural Language Processing .......................................... 192

  • 14

    Table 87. SVM - 6 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 – Methodology,

    3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games .................................. 192

    Table 88. SVM (6 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 –

    Methodology, 3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games ........... 193

    Table 89. Naive Bayes - 3 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 -

    Methodology ............................................................................................................................... 194

    Table 90. Naive Bayes (3 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 -

    Methodology ............................................................................................................................... 194

    Table 91. Naive Bayes - 4 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 –

    Methodology, 3 – Natural Language Processing ........................................................................ 195

    Table 92. Naive Bayes (4 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 –

    Methodology, 3 – Natural Language Processing ........................................................................ 195

    Table 93. Naive Bayes - 5 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 –

    Methodology, 3 - Miscellaneous, 4 – Natural Language Processing .......................................... 196

    Table 94. Naive Bayes (5 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 –

    Methodology, 3 - Miscellaneous, 4 – Natural Language Processing .......................................... 196

    Table 95. Naive Bayes - 6 classes Confusion Matrix. 0 - Computer Vision, 1 – Medical, 2 –

    Methodology, 3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games ........... 197

    Table 96. Naive Bayes (6 classes) - Classification Report 0 - Computer Vision, 1 – Medical, 2 –

    Methodology, 3 - Miscellaneous, 4 – Natural Language Processing, 5 – Playing Games ........... 197

  • 15

    List of Abbreviations

    SVM Support Vector Machine

    k-NN k - Nearest Neighbor

    TF Term Frequency

    TF-IDF Term Frequency Inverse Document Frequency

    LDA Latent Dirichlet Allocation

    NMF Non-negative Matrix Factorization

    LDA Linear Discriminant Analysis

    CNN Convolutional Neural Network

    NLP Natural Language Processing

  • 16

    1. INTRODUCTION

    The growth of technologies and the continuous generation of data have posed unique

    challenges, especially to the data mining community. These challenges have motivated

    researchers and industry analysts to continually develop new tools and methods for improving

    the application of various machine-learning techniques [1] [2] [46]. The main goal is to identify

    patterns, build recommendation(s) systems, and predictive models that will eventually support

    an organization's decision-making process. The application of machine-learning techniques is

    broad and spans across different research areas [38] [62].

    Nowadays, data is considered as one of the most valuable assets that organizations and

    companies are willing to acquire. Vast volumes of data are being captured to gain a better

    insight into business processes, operations, products, customers, and more [25].

    Because the volume of unstructured data is overgrowing, many enterprises also turn to

    technological solutions to better manage and store their unstructured data. These can include

    hardware or software solutions that enable them to make the most efficient use of their

    available storage space [1] [3] [5] [44].

    In machine-learning tasks, supervised learning methods are essential because they allow us to

    make predictions. Supervised learning is also known as classification [9] [35]. Unsupervised

    learning methods are also very commonly used in data mining. Its primary purpose is to

    discover groups (clusters) of similar data, where elements on the same group are very similar

    and differ with the other groups' elements. Cluster analysis has been widely used in many

    applications such as business intelligence, image pattern recognition, web search, biology, and

    security. It is also used to improve recommendation systems. Similarly, searching engines do

    use clustering, and the cluster mechanisms improve the quality and the speed for a search [6]

    [26] [28].

    Statistical and analytical algorithms have recently shown favorable results in working with

    structured data. However, analyzing semi-structured and unstructured data is not a

  • 17

    straightforward task. Most proposed solutions are ad-hoc solutions that are applied to specific

    problems [10] [41] [54].

    Text mining is one of the most challenging areas in many machine-learning applications, mainly

    because of data nature. Textual data is unstructured, and as such, requires additional pre-

    processing steps [117]. Two of the most important measurements when applying machine-

    learning techniques, and especially when dealing with unstructured textual data, are accuracy

    and performance. Accuracy issues emerge as result of the variety of the data and performance

    issues due to the enormous volume. To tackle the issues mentioned above, one must establish

    a well-defined strategy to store and process "Big Data" [62].

    The two significant challenges in the world of big data are 1) storing the data and 2) processing

    the data. Since more and more data are available on the Web, applying statistical and analytical

    algorithms on these data is an important topic nowadays. Processing unstructured and massive

    amount of data usually requires more in-depth analysis and more pre-processing tasks than

    traditional data-mining techniques applied on regular datasets. Traditional

    techniques/algorithms have limitations on performance and accuracy when working with a

    massive amount of data; therefore, there is a need for new programming models/new

    programming paradigms to overcome these challenges [129]. Map Reduce, Storm, Spark

    Frameworks are the most common frameworks nowadays used for processing big data. The

    main advantage of these programming models is that they can process the data in parallel on a

    distributed environment [120].

    Machine-learning algorithms mainly supervised learning algorithms, and unsupervised learning

    algorithms cannot be easily integrated on distributed environments to process the data in

    parallel. There are required additional changes to the algorithms to make them suitable for

    processing big data. First, the structure of the data differs from the traditional datasets, and

    second not every sequential algorithm can be easily adapted and transformed to work in

    parallel [23].

    Various models and techniques have been proposed for overcoming these two challenges,

    which include distributed environments for parallel processing, cloud platforms, high

  • 18

    computing resources, GPU processing, etc. [1] [3] [122]. However, not always adding more

    resources is the best and preferred solution for these scenarios. On the one hand, distributed

    environments for parallel processing are widely used and arguably the most suitable for Big

    Data [3] [5]. However, any custom distributed environment comes with the cost of

    configuration and the knowledge to prepare it. On the other hand, cloud solutions have been

    one of the fanciest, since they allow for processing and distributing data stores and integrating

    them from legacy systems. Nevertheless, the cost of cloud is the main drawback and the most

    significant limitation for researchers and small and medium-sized companies that cannot afford

    it [1].

    Recommendation systems are becoming more and more important nowadays in solving and

    helping people in different tasks. They are continuously used in computer science, finance,

    medicine, sports, and many other areas, for automating or semi-automating various tasks. They

    are built on top of different data sets and data structures, starting from various databases,

    textual data, Web, and streaming data from different sensors [51] [67] [76]. These systems can

    also help researchers find the most relevant research articles for their research fields.

    Therefore, in this thesis, we propose and evaluate a hybrid model for recommending research

    articles built using both supervised and unsupervised machine-learning techniques.

    Extracting information and knowledge from unstructured data is becoming more and more

    important nowadays since there is a potential of utilizing machine-learning techniques on

    organizing and extracting details from textual data [10].

    To improve the process of literature review is crucial, therefore we target this problem by using

    machine-learning techniques. This problem is critical nowadays to produce more qualitative

    and novel research topics by new researchers. One of the most important aspects is making

    sure the researchers can explore the research fields easier and check for recent activities and

    topics for each field. As a result, they need to have the ability to check the dependency and

    relationship between various research fields in recent years.

  • 19

    1.1 Problem description

    Researchers spend too much time reading others' work and finding research questions [143].

    This process requires lots of effort in reading and classifying relevant papers. The process of

    literature review is a challenging task for new researchers in different fields of study. Through

    the proposal in this thesis, we try to ease the literature review process and speed up the time

    for defining the research problem.

    By analyzing the possible research gaps, we define a problem that we target.

    Can we automate the process of literature review?

    Our study aims to collect/retrieve and analyze research/scientific articles by applying machine-

    learning techniques to recommend research articles or/and research gaps to researchers based

    on their research fields. With research articles, we mean scientific articles of any field where we

    take into consideration the following attributes:

    • Title – the title of the research article plays an important role on the analysis process,

    • Author/s – it is important to know the author/s of the research article,

    • Year of publication,

    • Abstract,

    • Keywords,

    • Content,

    • Contribution,

    • Results,

    • Future work,

    • Conference/Journal,

    • Related articles,

    • Bibliography.

    We plan to propose a model based on a large dataset containing research articles, which will

    use a hybrid model for recommending research articles to researchers. This approach will use

  • 20

    the input parameters such as an abstract, a list of keywords, research articles, or research

    field/s. In Figure 1, we present the overall architecture of our research experiment.

    We aim to automate and/or semi-automate the process of literature review using machine-

    learning algorithms. Nowadays, researchers, master students, Ph.D. candidates spend too much

    time identifying the research fields they want to target; in addition, they spend too much time

    investigating the research gaps and possible future work on a field. Therefore, our main

    contribution is to build a hybrid model based on machine-learning algorithms (supervised and

    unsupervised learning).

    We will also focus on integrating variously supervised and unsupervised learning techniques

    into our model. We will evaluate the techniques and extract the most efficient ones.

    Our aim is also to generate a scalable and updatable training set by using any textual dataset.

    The training set can be used to recommend research articles or apply other analysis like feature

    extraction, classification of new research articles, and a correlation between different research

    fields.

  • 21

    Figure 1. Overall Architecture

    The first part is integrating digital research articles from various digital libraries. As a result of

    the first phase, we will have a centralized database that can be used to construct a

    recommendation system. The second phase is to build a model based on unsupervised and

    supervised learning algorithms, which, as an output, will generate a training dataset (model)

    that will be used to recommend research articles. The third phase will analyze the quality of the

    model; the focus will be on the model's ability to update itself and it’s scalability. By scalability,

    we mean the distribution of our training set as the dataset is increasing in the future. The fourth

    phase will analyze the models for storing our training set to have an efficient recommendation

    system. Finally, we will have the researchers going to test and evaluate the model based on their

    queries.

  • 22

    1.2 Hypothesis

    1. The literature review process can be simplified by using a hybrid model based on text-

    mining algorithms.

    1.3 Research Questions

    1. Which unsupervised learning algorithms provide the most efficient results on large data

    sets?

    2. Which data pre-processing techniques provides higher accuracy for unsupervised and

    supervised learning on textual documents?

    3. How to generate an efficient training set by using machine-learning algorithms on

    textual documents?

    4. How to construct an updatable model based on unsupervised and supervised learning

    algorithms? Which is the best data storage for machine-learning tasks?

    5. What are the different Supervised Learning Techniques for textual documents? Which

    techniques are the most efficient for our research?

    1.4 Methodology

    Our research will answer the one defined hypothesis and five research questions. We will

    analyze whether by using machine-learning techniques, we can automate/semi-automate the

    process of literature review. As a result, we propose an efficient hybrid model on top of which

    we will apply our experiments. Finally, we are planning a case study to evaluate the model,

    which will be constructed on a specific data set.

    The process of our literature review is an example of our qualitative research. For data

    gathering, we have analyzed others' work on text mining, data mining, and big data. As a result,

    we have extracted possible research gaps in the field and identified the research problem we

    aim to solve.

  • 23

    Experiments and the generated empirical results will be the base for our quantitative research

    methodology. We will also provide comparisons between various techniques (supervised and

    unsupervised learning techniques) on efficiency.

    A suitable plan of action must be established and carried out to resolve the problem. This

    section introduces the chosen research strategy and specific scientific techniques for data

    collection. Subsequently, the application of the method and research ethics has been

    described. This research, in terms of strategy, follows the experimental approach. ("An

    experiment is an empirical investigation under controlled conditions designed to examine the

    properties of, and relationships between, specific factors" [7])

    Our focus is on supervised and unsupervised learning algorithms because we see a potential for

    building a hybrid solution with a combination of classification and clustering algorithms to build

    a highly accurate model for textual data. Our study focuses on highly accurate data-mining

    techniques for textual data. We also analyze the need for the highly efficient recommendation

    systems and the limitations of algorithms on big datasets.

    The goal of this thesis is to evaluate the proposed model for recommending research articles. It

    will present the overall accuracy and the accuracy of the individual steps. Multiple supervised

    and unsupervised techniques have been taken into consideration to evaluate and identify the

    most efficient techniques for this type of system. Our primary hypothesis is that we can ease

    the literature review process for researchers and identify trend topics for a given field.

    Our solution is based on machine-learning techniques, and it contains 3 (three) phases:

    1. Generating a group of relevant articles from raw input of research articles. By applying

    unsupervised learning techniques, the aim is to generate “n” number of clusters where

    we have similar articles on each cluster. The outputs of this phase are the clusters,

    distance between clusters, centroid, outlier clusters, cluster labels, and extracted

    keywords for each cluster.

    2. Using the clusters generated from Phase 1 to build a model based on supervised

    learning techniques, which is used to add more articles in the future and recommend

  • 24

    research articles. Various machine-learning techniques have been considered in this

    phase; the results from the experimentations have been presented in Chapter 5.

    3. Using topic modeling to extract trend topics and keywords for specific fields (clusters).

    All outcomes, i.e. the information extracted from all three phases, have been presented in a

    graph structure, which is used as the basis of our recommendation system and the extraction of

    trend topics.

    To evaluate existing techniques and propose the best approach, we have tested three

    unsupervised learning techniques for Phase 1, and 15 supervised learning techniques for Phase

    2 on two different datasets. For this purpose, we conducted five studies:

    1. Evaluation of the accuracy and execution time of unsupervised learning techniques,

    which represents Phase 1 of our model.

    2. Evaluation of supervised learning techniques, which represents Phase 2 of our model.

    3. Evaluation of the process of identifying trend topics for specific fields of topic modeling.

    4. Comparison of cost and energy consumption of our model running in three different

    environments. 1) cloud instance, 2) cloud functions and 3) distributed Raspberry PIs.

    5. Evaluation of different data sources can be used to organize the extracted information

    and build efficient recommendation systems.

    1.5 Thesis Structure

    Chapter 2 of the thesis presents the background and state of the art in the field. In this chapter,

    we initially have a list of references used for state of the art. It continues with an overview of

    various machine-learning techniques. We represent a survey of unsupervised learning

    techniques, their different types, and their characteristics. It continues with an overview of

    supervised learning techniques and their application on textual data. In addition, state of the art

    for various data representation models for textual data has been presented. In the last part of

    Chapter 2, it continues with the benefits of HADOOP in Big Data and different data storage

  • 25

    models, starting from traditional ones, continuing with distributed data storage, and finally, an

    overview of cloud data platforms and graph structures.

    In Chapter 3, a proposed solution has been presented. A hybrid model for recommending

    research articles and all the three phases and the data representation have been described in

    detail. Section 3.4 presents an enhanced version of the proposed model that can run in

    distributed Raspberry PIs. Section 3.5 shows all the pre-defined pre-processing steps used for

    the textual data for our experiments. In Section 3.6, we have the list of Datasets used for

    different phases of evaluation of the model, and finally, in Section 3.7, the experimental setup

    and technical details have been presented.

    All experiments and results have been presented in Chapter 4. It contains all the detailed results

    for all the 5 (five) conducted studies of this thesis. Initially, it presents results for proof of

    concepts for unsupervised learning and supervised learning techniques. Then it continues with

    the results of Phase 1 and Phase 2 of the proposed model. In Section 4.4, the results for

    comparison and experiments in cost and energy consumption have been presented. In the last

    part, the outcomes of topic modeling and graph representation of the extracted information

    have been presented.

    Chapter 5 represents all the discussions on findings that derived from the experimentation from

    Chapter 4. In this chapter, we also present comparisons of our results with other similar studies

    and overall findings for different phases, environments, and experimentation setups.

    Chapter 6, Conclusion, is the last section of this study. An overview of all our work in this Ph.D.

    dissertation has been presented along with our plans for future activities.

    1.6 Conclusion

    In this chapter, we stated the research problem, which emerged as a result of the literature

    review presented in Chapter 2. We saw that research articles can be considered as

    unstructured data that can be represented in textual documents. In addition, new researchers,

    Master and Ph.D. students spend too much time defining their research fields and their theses.

  • 26

    There is a potential of easing the process of literature review by utilizing machine-learning

    techniques on textual documents.

    In Section 1.2, the central hypothesis was stated, along with five research questions. The

    central hypothesis questions if the literature review process can be simplified using machine-

    learning techniques on textual data. In addition, the research questions target various issues on

    the efficiency of current machine-learning techniques for textual data.

    In Section 1.3, we deal with the importance and benefits of solving this problem. It will have a

    good impact on improving the quality of research work for young researchers. The

    methodology for targeting this research problem is by using experimental and empirical results.

    As a result,, a hybrid model will be proposed based on machine-learning techniques. The goal is

    to extract information from unstructured data, in our case research articles and store that

    extracted information in a graph structure. Finally, multiple supervised and unsupervised

    learning techniques will be evaluated.

  • 27

    2. LITERATURE REVIEW

    Data on the Web is increasing rapidly. More and more data in different formats is generated.

    This leads us to a new concept, mainly the big data concept. Each time we have a big data task

    there are two main concerns:

    • Storing a massive amount of data (big data)

    • Processing large datasets

    All data come(s) from sources, such as external data sources, mobile devices, social media, IoT,

    and media or internal data source transactions, log data, e-mails, etc. In total less than 0.5% of

    all data is ever analyzed and used, therefore analyzing data, applying machine learning,

    applying statistical analysis on the data is very important nowadays [57].

    In machine-learning tasks, supervised learning methods are essential because they allow us to

    make predictions. Supervised learning is also known as “classification.” Classification algorithms

    are made of two parts. The first part is the learning part, where the model is constructed based

    on training data. The second part is the classification part, where the model is used for

    prediction. There are many classification algorithms proposed. The most popular are Random

    Forest, Logistic Regression, Decision Tree Induction, Bayesian classifier, Neural Networks,

    Support Machines, etc. Every algorithm has its pros and cons; they have their application field

    and type of data applied [4] [8] [18] [19].

    Unsupervised learning methods are also very commonly used in data mining. Its primary

    purpose is to discover groups (clusters) of similar data, where elements in the same group are

    very similar, and they differ from elements of the other groups. Cluster analysis has been widely

    used in many applications such as business intelligence, image pattern recognition, web search,

    biology, and security. It is also used to improve recommendation systems. In addition, search

    engines do use clustering; cluster mechanisms improve the quality and the speed of a search.

    Detection outliers is another application of clustering on data. By finding elements that do not

    belong to any cluster (group), we have detected the outliers [6] [31] [52] [53].

  • 28

    The process of the literature review is based on six research questions:

    1. Which are the most crucial text-mining techniques for text classification and text

    clustering?

    This question identifies the state of the art in text mining algorithms and techniques and the

    most recent algorithms proposed.

    2. What data transformation models are used for textual data?

    This question is used to determine the impact of the models on the efficiency of the algorithms

    considering accuracy.

    3. What are the limitations of high-performance text-mining algorithms on big datasets?

    This question provides information for the limitations by using traditional approaches on big

    datasets.

    4. Which are the proposed models for high-performance data-mining algorithms?

    This question provides the list of proposed solutions for high-performance data-mining

    algorithms on big datasets.

    5. What contributions have been proposed for research article analysis?

    This question shows whether there are proposed solutions or related work.

    6. What is the importance of high-performance text-mining algorithms nowadays?

    This question identifies the trends related to our research field.

    To answer and conduct an overview of the six research questions, we have manually analyzed

    research articles from the IEEE-Xplore and ACM digital libraries. In total, we have analyzed more

    than 130 relevant research articles. The papers are from 1994 to 2019. Furthermore, the

    distribution of papers based on the primary research field is as follows:

    Table 1 Research articles by field

    Text Mining Big Data Data Mining Parallel/Distributed Programming

    Models

    50 23 45 15

  • 29

    Figure 2. Paper distribution by year

    The classification of the research articles is as follows:

    Figure 3. Paper classification

    In the first and third research questions of the literature review, we identify the state of the art

    in text mining algorithms and techniques and the most recent algorithms proposed. While data

    mining is a broad field, our focus was only on the algorithms and techniques applied to textual

    Data Type

    Text

    Log

    Relational DB

    Web

    Social Media

    Algorithms

    Classification

    Clustering

    Pattern Mining

    Information Retrieval

    Term Frequency

    TF-IDF

    Enhanced TF-IDF

    Word2Vec

    Paradigm

    Traditional

    Parallel-Distributed

    Programming

    Contribution

    Model

    Method

    Review

    Survey

  • 30

    data sets. For this part, we have included two groups of techniques, 1) classification and 2)

    clustering.

    From the papers, we can see that there are various techniques proposed for text clustering and

    text classification. In most of the papers, we see that the focus is on improving the algorithms'

    accuracy. The proposed classification algorithms are from two groups: 1) lazy learners and 2)

    eager learners.

    The different versions of the probabilistic algorithms, like the Naïve Bayes algorithm, are highly

    used [64] [130]. Naïve Bayes has been proven an accurate algorithm when the training set is big

    enough. SVM is another proposed solution for textual data classification. Various researchers

    present SVM as one of the most accurate algorithms for text data classification. The only

    drawback is that as the number of classes increases, its performance in accuracy decreases [4]

    [30] [130]. Hybrid solutions have also been proposed, which are a combination of two or more

    classification algorithms.

    The two algorithms, k-NN and Centroid Classifier, showed that they could achieve high-

    classification accuracy in classifying text documents [19]. When the number of documents is

    not very high, the algorithms can even perform very well in performance. However, when we

    increase the number of input data in the training set, the execution time increases too. As the

    data is increasing on the Web, traditional algorithms are not sufficient. Therefore, we are

    required to process data by using parallel/distributed programming models [130].

    Various techniques have been proposed for information retrieval and feature extraction on

    textual data [2] [11] [12]. One of the most fundamental techniques is using the term analysis of

    text documents. Bag of Words representation and term frequency are often used when

    experimenting with textual documents; both have shown higher results in accuracy. Term

    Frequency - Inverse Document Frequency is a way to score the importance of words (terms) in

    a textual document based on how frequently it appears across multiple papers. It has proven to

    increase the accuracy of supervised and unsupervised learning algorithms.

  • 31

    Word sequences, graph structures, and word embeddings (i.e., word2vec) are the most recent

    models proposed by researchers which have shown promising results in various applications by

    using various datasets [2] [65] [66]. Using these recent models, there is a potential to think of

    more advanced machine-learning applications on textual datasets.

    Word2vec is a model that produces word embeddings. These models represent two-

    layer neural networks that are trained to reconstruct linguistic contexts of each word.

    Word2vec takes as its input a dataset of text and generates a vector, usually made of multiple

    dimensions, with each unique word in the corpus assigned a corresponding vector in the space

    [12].

    Word2vec has been created by Tomas Mikolov and his research team at Google. Word2vec

    algorithm has many advantages compared to other algorithms like Latent Semantic Analysis.

    LSA (Latent semantic analysis) is a natural language-processing technique for analyzing

    relationships between textual documents and their terms by producing a set of concepts

    related to the documents and terms [11]. Word embedding is the collective name for feature

    learning a language modeling in natural language processing (NLP), where terms from the

    documents are transformed to vectors of real numbers.

    Based on our literature review process in our field of interest, we can see an increase of

    interest for researchers in the application of machine-learning algorithms on big data sets. The

    focus is on:

    • accuracy of algorithms,

    • performance,

    • optimization of algorithms

    • Information Retrieval and data storage.

    The quality of our literature review is dependent on two key factors, 1) the number of papers

    used in the process and 2) the construction of the research questions for the literature review.

    As a result of this process, we have identified some potential Research GAPS topics, which are

    https://en.wikipedia.org/wiki/Word_embeddinghttps://en.wikipedia.org/wiki/Neural_networkhttps://en.wikipedia.org/wiki/Vector_spacehttps://en.wikipedia.org/wiki/Corpus_linguisticshttps://en.wikipedia.org/wiki/Latent_Semantic_Analysishttps://en.wikipedia.org/wiki/Natural_language_processinghttps://en.wikipedia.org/wiki/Language_modelhttps://en.wikipedia.org/wiki/Natural_language_processinghttps://en.wikipedia.org/wiki/Vector_(mathematics)https://en.wikipedia.org/wiki/Real_numbers

  • 32

    possible for future work in our research field:

    1. Hybrid solutions for high accuracy text-clustering and text-classification algorithms.

    2. Improving text-similarity algorithms by extracting contextual meaning.

    3. Adapting traditional machine-learning algorithms to parallel/distributed programming

    models.

    4. High-performance text-mining algorithms, analysis, and best practices.

    5. Fully automated text classification systems based on machine-learning algorithms.

    6. Real-time machine-learning algorithms.

    2.1 Document Clustering

    Managing and organizing data is essential and vital nowadays. Many companies, groups of

    people or individuals are striving to organize data to get a better and more comprehensive

    insight. Having in mind that many operations are moving to digitalization, the amount of data

    generated is becoming so vast. A good starting point to think about organizing and managing

    data is by grouping (clustering) them. Clustering represents an excellent technique for grouping

    similar or related objects. "Clustering is the unsupervised classification of patterns

    (observations, data items, or feature vectors) into a group (clusters)" [78]. There is an

    abundance of research on clustering and many algorithms proposed by many publications on

    the topic [6] [78] [81] [82].

    Different algorithms and techniques have been proposed for generating clusters of textual

    documents based on their content. The most common clustering techniques are:

    • Hierarchical clustering,

    • Density-based clustering,

    • Partitioning clustering algorithms,

    • Graph-based algorithms,

    • Grid-based algorithms,

    • Model-based clustering algorithms,

    • Combinational clustering algorithms.

  • 33

    One way to summarize a large amount of data is to use clustering techniques to group data in a

    meaningful way so that the objects inside the groups, or clusters, have the most similarities

    while objects in different groups have the most differences. Two types of clustering algorithms

    are available: nested and partitioned. A nested clustering algorithm creates overlapped

    clusters, while a partitioned clustering algorithm creates non-overlapping clusters. For

    programming research, in which differences between groups of programmers are required, the

    second type of clustering algorithm is more appropriate.

    2.1.1 K-means Clustering

    K-means algorithm is a center-based clustering algorithm in which a most representative point

    (object) for one cluster is chosen, and the distance between the representative point and all

    other points (or objects) are computed. Since one can choose the “k” number of clusters, the

    algorithm is called K-means. This means that K representative objects (i.e. centroid or medoid)

    are selected. Each object is then assigned to the closest centroid and, therefore, the related

    cluster. In the next step, the center point is updated according to the objects that have been

    assigned to the cluster. For the newly created centroid, the new members of the cluster are

    computed again. This process is repeated until the centroids do not change, and the process

    converges. In the worst case, the algorithm will converge after at most k·n iterations [72] [85].

    K-means clustering has been widely applied to a large amount of data as one of the most

    efficient clustering algorithms. However, its performance, to a great extent, varies from the

    type of data applied.

    K-Means Algorithm steps:

  • 34

    Figure 4. K-Mean Clustering Process

    The Time Complexity of K-Means algorithm is O(l*k*m*n), where l is the number of iterations,

    k is the number of clusters, m is the number of dimensions (attributes) and n is the number of

    objects.

    Even though the K-Means Algorithm is widely used in practice there are some disadvantages:

    • It is sensitive to initialization,

    • It is sensitive to the outliers,

    • It can handle clusters only with symmetrical point distribution, and

    • We must define the value of K in the beginning.

    2.1.2 K-Means++ Clustering

    The proposed K-means++ algorithm follows almost the same approach as K-means, but with

    minor modifications in the centroids' initialization. It is also considered an improved version of

    K-means because it is less susceptible (less likely to be influenced by) to the initialization

    problem [26].

    1. Define centroids for the clusters

    2. Add the nearest group to every data point

    3. Set the place of every group to the mean value of all data-

    points which fit into that group.

    4. Repeat steps #2 and #3 until all are grouped

  • 35

    K-means++ chooses the first centroid uniformly at random. For each data object (point), it

    computes the distance of that point to the centroid. The newest centers, depending on the

    number of "k" set by the user, are assigned using a weighted probability distribution where the

    points that are the farthest from the centroid have a higher probability of being chosen as

    centroids. This modification of k-means arguably leads to solving the initialization problem,

    which in some cases can profoundly affect the accuracy of the algorithm [72] [85].

    2.1.3 K-Medoids Clustering

    K-Medoids algorithm is like K-Means in that they are both partitional and tend to break the

    data set into groups. However, in contrast to K-means, with K-medoids, the initialized centers,

    also called medoids, are data objects (points) from the dataset. K-medoids is more potent to

    outliers compared to K-means and can be used with any distance/similarity measure. However,

    K-Medoids is also more expensive, resulting in higher efficiency when applied on smaller

    datasets than larger datasets [86].

    Figure 5. K-Means vs. K-Medoids Algorithms

    Source: Cross Validated – Stack Exchange

  • 36

    Although K-medoids is more robust to outliers compared to K-means, it does not always result

    in satisfactory results when dealing with noisy data and outliers. For this reason, new

    algorithms have been proposed. One of them is the Distributed K-Means clustering algorithm,

    which is another enhanced version of the K-Means algorithm. This algorithm relies on

    normalized datasets to identify the groups of clusters [86].

  • 37

    Figure 6. K-Medoids Process

    1. Discover maximum values and minimum values of every feature from every local

    dataset and convey them into central place

    2. Compute global max and min value at central place

    3. Normalize the real scalar values of datasets which are local with overall max

    and min values.

    4. Cluster every local dataset through k-means clustering and achieve centroids

    along with cluster index for every dataset.

    5. Make single dataset named as centroids by merging cluster centroids of local

    dataset into a single dataset

    6. Cluster centroids dataset obtain overall centroid through K-means

    7. Later calculating Euclidean distance between the object and overall centroids it updates local cluster indices by conveying each object to adjacent cluster centroid.

  • 38

    2.1.4 Hierarchical Clustering

    Hierarchical Clusters produce a set of nested clusters organized as a hierarchical tree [86]. An

    example of a hierarchical cluster is shown in the next figure. We can use a dendrogram to

    visualize the sequences of mergers or splits.

    Figure 7. Hierarchical Clustering Source: http://faculty.juniata.edu/rhodes/ml/hiercluster.htm

    Hierarchical Clustering algorithms can be beneficial for specific cases when working with data.

    In our case, we consider this type of clustering algorithms in our final phase, the

    recommendation phase. In our proposed model, we will distinguish between different levels of

    details for a given cluster. Therefore, we will apply this type of clustering techniques after the

    clusters have been generated.

    Two types of Hierarchical Clustering Algorithms exist:

    • Agglomerative Hierarchical clustering

    • Divisive Hierarchical clustering

    The Agglomerative Hierarchical clustering is a bottom-up approach, and the algorithm is as

    follows:

    http://faculty.juniata.edu/rhodes/ml/hiercluster.htm

  • 39

    Figure 8. The Agglomerative Hierarchical clustering is a bottom-up approach (left);

    the Hierarchical clustering is a top-down approach (right)

    In practice, top-down clustering algorithms are more accurate compared to the bottom-up

    approach.

    The advantages of the Hierarchical Clustering are as follows:

    • No assumptions on the number of clusters. Any desired number of clusters can be obtained.

    by ‘cutting’ the dendrogram at the proper level.

    • Hierarchical clustering may correspond to meaningful taxonomies.

    There are two disadvantages with Hierarchical clustering:

    • It is subtle with noisy data and outliers.

    • Time complexity for this algorithm is O(n2) where n is the number of textual documents. If

    your dataset is big, the algorithm will perform too slowly.

    2.1.5 Text representation formats

    When implementing a Bag of Words model, insignificant words are filtered out from the

    training data mainly because they hold no meaning and increase the dimensionality of data.

    When text is represented as a “bag-of-words,” a high number of dimensions is generated.

    1. Compute the distancematrix between the

    input data points

    2. Let each data point be a cluster

    3. Repeat

    4. Merge the two closest clusters

    5. Update the distance matrix until only a single cluster

    remains Divisive

    1. Start at the top with all patterns in

    one cluster

    2. The cluster is split using a flat clustering

    algorithm

    3. This procedure is applied recursively

  • 40

    Hence, it is desirable to apply techniques for reducing the dimensionality of data but still

    retaining as much information as possible. There are several adequate techniques for reducing

    the dimensionality of data. For instance, some well-known techniques which only take the most

    important words are [2] [11] [12]:

    • TF-IDF

    • PCA

    • LDA

    • SVD

    It is worth noting that there is no single best technique for dimensionality reduction. All the

    above techniques are widely used and shall be considered depending on the environmental

    setup. Additionally, to support the simplifying power of a Bag of Words representation, it is

    highly recommended to apply one of the Natural Language Processing methods like

    tokenization, relation detection, entity detection, and segmentation.

    On the other hand, Word2Vec document-representation of words is a method in

    multidimensional space proposed by [12] that processes text by vectorizing words. Word2vec

    belongs to a group of related models that are used to produce word embeddings. These models

    are considered shallow and two-layer neural networks that are trained to model linguistic

    contexts out of words. The input fed to a Word2vecmodel is a large corpus of text, while the

    output generated is a vector space that typically includes several hundred dimensions with

    each unique word in the corpus being assigned a corresponding vector in the space.

    One of the main advantages of Word2Vec is that it relies on exhibiting interesting linguistic

    regularities between words on a large text corpus. Simply put, it extracts the distance between

    vector representations of words and turns text into a numerical form. Recently, Word2Vec has

    been used in various natural language processing tasks, clustering, and classification

    applications [2] [12] [65] [66].

    The efficiency and effectiveness of many classification algorithms have been increasingly relying

    on the power of Word2Vec for extracting valuable information from text [65].

    https://en.wikipedia.org/wiki/Word_embeddinghttps://en.wikipedia.org/wiki/Neural_networkhttps://en.wikipedia.org/wiki/Dimensionshttps://en.wikipedia.org/wiki/Corpus_linguistics

  • 41

    Word sequences, graph structures, and word embeddings (i.e., word2vec) are the most recent

    models proposed by researchers, which have shown promising results in various applications by

    using various datasets. The most recent model developments have proved to be considerably

    potential for more advanced machine-learning applications on textual datasets.

    Word2vec was created by a team of researchers led by Tomas Mikolov at Google. The algorithm

    has since been subsequently analyzed and explained by other researchers [2] [3]. Embedding

    vectors created using the Word2vec algorithm has many advantages compared to earlier

    algorithms[1] like Latent Semantic Analysis.

    Moreover, two additional relevant techniques in natural language processing are: Latent

    Semantic Analysis (LSA) and LSTM. LSA is a technique in natural language processing, in

    particular, distributional semantics for analyzing relationships between different terms existing

    in a group of documents [2]. The purpose and usefulness of this technique are to produce a set

    of relevant concepts related to documents and terms. Feature-learning and language-modeling

    techniques in natural language processing are also known as word-embedding techniques.

    Word-embedding is the technique where words or phrases from a vocabulary are mapped

    to vectors of real numbers.

    LSTM neural networks are considered state-of-the-art approaches, offering very high accuracy

    results in several Natural Language Processing tasks such as Bi-directional LSTM-CRF [18] for

    Part of Speech Tagging and Tree-LSTMs for sentiment analysis [19]. Also, simpler versions of

    LSTMs, referred to as Gated Recurrent Units (GRUs) [16], are used as crucial parts of larger

    systems like state-of-the-art Dynamic Memory Networks, developing more complicated tasks

    such as question-answering systems [17].

    2.2 Supervised Learning

    In machine learning, the classification task is also known as supervised learning because it uses

    a set of labeled training data to learn (generate model). Then, based on the learning (model), it

    classifies new instances. Each classification task is made of two phases: a training step and a

    https://en.wikipedia.org/wiki/Googlehttps://en.wikipedia.org/wiki/Word2vec#cite_note-mikolov-1https://en.wikipedia.org/wiki/Latent_Semantic_Analysishttps://en.wikipedia.org/wiki/Natural_language_processinghttps://en.wikipedia.org/wiki/Distributional_semanticshttps://en.wikipedia.org/wiki/Vector_(mathematics)https://en.wikipedia.org/wiki/Real_numbers

  • 42

    testing step [9]. The training data is used to apply “learning algorithms” (classification

    algorithms). On the other hand, the testing data is used to test the accuracy of the classifiers.

    Figure 9. Supervised Learning process

    Classifying text is an important topic nowadays. Some application fields of text mining and text

    classification are:

    Classify Reviews: This is a typical application for e-commerce Web sites. In almost all products,

    we have a section where the users write their reviews. The company needs to read and

    evaluate all the reviews and classify them as positive or negative. However, in most cases, it is

    impossible to go through all the reviews by reading them manually. Therefore, a solution would

    be to teach a classifier that will automatically differentiate positive and negative reviews.

    Spam Detection: A trending topic in the last decade. Everyone receives e-mails containing

    spam. For mail servers, it is an imperative to create systems that can quickly detect spam e-

    mails. Using classification techniques, models can be generated that will classify mails as spam

    or not spam.

    Author Detection: Every author has his/her style of writing. This is another topic where text

    classification is applied. However, the author's detection topic is more critical. In most of the

    cases, the writing style of the two authors is very similar. Therefore, we cannot reach high

    accuracy on this topic.

    Language Detection: A prevalent task nowadays. Google uses it in its translation engine. This

    type of problem achieves the highest accuracy. This is because languages are very different.

    Because of that, in most cases, this task is not considered a classification task.

  • 43

    Classify News Articles: This is the topic demonstrated in this thesis. An advantage of working on

    this topic is that you can find many news articles on the Web to experiment with and test

    different algorithms. It is a prevalent task where researchers are trying to improve the accuracy

    of classification algorithms. A drawback of this task is that in news articles, new words appear

    very often.