proceedings of the 17th international conference on

ICON 2020

17th International Conference on Natural LanguageProcessing

Proceedings of the TechDOfication 2020 Shared Task

December 18 - 21, 2020Indian Institute of Technology Patna, India

©2020 NLP Association of India (NLPAI)

ii

Introduction

These shared task proceedings concluded the shared task on Technical DOmain Identification, namedas TechDOfication 2020, launched on 7th October 2020. The shared task was collocated with the 17thInternational Conference on Natural Language Processing (ICON 2020), held at IIT-Patna, India. Thegoal of the shared task was to automatically identify the technical domain of a given text in a specifiedlanguage. The languages included in this task were English, Bangla, Gujarati, Hindi, Malayalam,Marathi, Tamil, and Telugu.

Two subtasks were part of this shared task. The first subtask was to identify a Coarse-grained technicaldomain like Computer Science, Communication Technology, Management, Math, Physics, Life Science,Law etc. The second subtask involved the task of identification of fine-grained subdomains for theComputer Science domain like Operating System, Computer Network, Database etc. The first subtaskwas conducted in all the mentioned languages while the second subtask only involved English.

We received nine system submissions and system description papers. Each system description paper wasreviewed by two members of the reviewing committee – all papers were accepted. Macro F1 scores wereused to evaluate the systems.

Both Machine Learning and Neural Network methods were used by different teams in this sharedtask. Support Vector Machine and Voting Classifiers were the predominantly used Machine Learningmodels. Different Neural architectures like Multi-Layer Perceptron, BiLSTM, BERT based models,Graph Convolutional Neural Networks, XLM-RoBERTa, Multichannel LSTM-CNN were also used. Wewould like to thank the ICON-2020 organizers, the shared task participants, the authors, and the reviewersfor making this shared task successful.

Shared task page: http://ssmt.iiit.ac.in/techdoficationMain conference page: https://www.iitp.ac.in/~ai-nlp-ml/icon2020/index.html

iii

Organizing Committee:

Dipti Misra Sharma (IIIT-Hyderabad)Asif Ekbal (IIT-Patna)Karunesh Arora (C-DAC, Noida)Sudip Kumar Naskar (Jadavpur University)Dipankar Ganguly (C-DAC, Noida)Sobha L (AUKBC-Chennai)Radhika Mamidi (IIIT-Hyderabad)Sunita Arora (C-DAC, Noida)Pruthwik Mishra (IIIT-Hyderabad)Vandan Mujadia (IIIT-Hyderabad)

v

Table of Contents

MUCS@TechDOfication using FineTuned Vectors and n-gramsFazlourrahman Balouchzahi, M D Anusha and H L Shashirekha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

A Graph Convolution Network-based System for Technical Domain IdentificationAlapan Kuila, Ayan Das and Sudeshna Sarkar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

Multichannel LSTM-CNN for Telugu Text ClassificationSunil Gundapu and Radhika Mamidi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .11

Multilingual Pre-Trained Transformers and Convolutional NN Classification Models for Technical Do-main Identification

Suman Dowlagar and Radhika Mamidi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

Technical Domain Identification using word2vec and BiLSTMKoyel Ghosh, Dr. Apurbalal Senapati and Dr. Ranjan Maity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

Automatic Technical Domain IdentificationHema Ala and Dipti Sharma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

Fine-grained domain classification using TransformersAkshat Gahoi, Akshat Chhajer and Dipti Mishra Sharma. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .31

TechTexC: Classification of Technical Texts using Convolution and Bidirectional Long Short Term Mem-ory Network

Omar Sharif, Eftekhar Hossain and Mohammed Moshiul Hoque . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

An Attention Ensemble Approach for Efficient Text Classification of Indian LanguagesAtharva Kulkarni, Amey Hengle and Rutuja Udyawar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

vii

Shared Task Program

Monday, December 21, 2020

+ 14:00 - 14:30 Talk by Sobha L, AUKBC-Chennai

+ 14:30 - 14:45 Shared Task Overview

Presentations

16:45 - 16:55 Automatic Technical Domain IdentificationHema Ala and Dipti Sharma

16:58 - 17:08 Technical Domain Identification using word2vec and BiLSTMKoyel Ghosh, Dr. Apurbalal Senapati and Dr. Ranjan Maity

17:11 - 17:21 A Graph Convolution Network-based System for Technical Domain IdentificationAlapan Kuila, Ayan Das and Sudeshna Sarkar

17:24 - 17:34 Fine-grained domain classification using TransformersAkshat Gahoi, Akshat Chhajer and Dipti Mishra Sharma

17:37 - 17:47 MUCS@TechDOfication using FineTuned Vectors and n-gramsFazlourrahman Balouchzahi, M D Anusha and H L Shashirekha

17:50 - 18:00 TechTexC: Classification of Technical Texts using Convolution and BidirectionalLong Short Term Memory NetworkOmar Sharif, Eftekhar Hossain and Mohammed Moshiul Hoque

18:03 - 18:13 An Attention Ensemble Approach for Efficient Text Classification of Indian Lan-guagesAtharva Kulkarni, Amey Hengle and Rutuja Udyawar

18:16 - 18:26 Multilingual Pre-Trained Transformers and Convolutional NN Classification Mod-els for Technical Domain IdentificationSuman Dowlagar and Radhika Mamidi

18:29 - 18:39 Multichannel LSTM-CNN for Telugu Text ClassificationSunil Gundapu and Radhika Mamidi

ix

Proceedings of the 17th International Conference on Natural Language Processing: TechDOfication 2020 Shared Task, pages 1–5Patna, India, December 18 - 21, 2020. ©2020 NLP Association of India (NLPAI)

MUCS@TechDOfication using FineTuned Vectors and N-grams

F BalouchzahiDept. of Computer Science

Mangalore UniversityMangalore - 574199

Indiafrs [email protected]

M D AnushaDept. of Computer Science


[email protected]

H L ShashirekhaDept. of Computer Science


[email protected]

Abstract

The increase in domain specific text process-ing applications are demanding tools and tech-niques for domain specific Text Classification(TC) which may be helpful in many down-stream applications like Machine Translation,Summarization, Question Answering etc. Fur-ther, many TC algorithms are applied on glob-ally recognized languages like English giv-ing less importance for local languages par-ticularly Indian languages. To boost the re-search for technical domains and text process-ing activities in Indian languages, a sharedtask named ”TechDOfication2020” is orga-nized by ICON’20. The objective of thisshared task is to automatically identify thetechnical domain of a given text which pro-vides information about coarse grained tech-nical domains and fine grained subdomainsin eight languages. To tackle this challengewe, team MUCS have proposed three models,namely, DL-FineTuned model applied for allsubtasks, and VC-FineTuned and VC-ngramsmodels applied only for some subtasks. n-grams and word embedding with a step of fine-tuning are used as features and machine learn-ing and deep learning algorithms are used asclassifiers in the proposed models. The pro-posed models outperformed in most of sub-tasks and also obtained first rank in subTask1b(Bangla) and subTask1e (Malayalam) with f1score of 0.8353 and 0.3851 respectively usingDL-FineTuned model for both the subtasks.

1 Introduction

TC is one of the important areas of research inNatural Language Processing (NLP). Most of theTC experiments being conducted are on resourcerich languages such as English, Spanish etc. givingless importance to resource poor languages particu-larly Indian languages. Further, with the increasein domain specific text processing applications, do-main specific TC is gaining importance demanding

specialized tools and techniques to handle domainspecific text datasets (Sun et al., 2019) for manydownstream applications like Machine Translation,Summarization, Question Answering etc. In this di-rection, a shared task named ”TechDOfication2020:Technical Domain Identification” is organized inassociation with 17thInternational Conference onNatural Language Processing1 (ICON) 2020 to au-tomatically identify the technical domain of a giventext in eight languages, namely, English, Bangla,Gujarati, Hindi, Malayalam, Marathi, Tamil, andTelugu. These text provides information aboutspecific coarse grained technical domains and finegrained subdomains. Details of the shared tasksare provided in the task website2. Technical do-main identification can be modeled as a domainspecific TC task. In this paper, we describe ourmodels, namely, VC-ngrams, VC-FineTuned andDL-FineTuned proposed by our team MUCS fortechnical domain identification in eight languagesusing n-grams and fine-tuned vectors as features.

n-grams features which are simple and scalableare utilized in many NLP tasks. With a bigger valueof ‘n’, a model can store more contexts with a well-understood space-time tradeoff enabling many TCexperiments to scale up efficiently. Word embed-ding or vector representation of words capturesgrammatical and semantic information of a wordwhich can be an enlightening feature for many NLPapplications. In this study, we investigate the twopopular word vector models namely, GloVe3 forEnglish (subTask1a and subTask2a) and Bangla(subTask1b) and fastText4 (we didn’t get Glovefor other languages) for rest of subtasks for learn-ing word vectors. GloVe model produces a vectorspace with meaningful substructure, as evidenced

1http://www.iitp.ac.in/ ai-nlp-ml/icon2020/2https://ssmt.iiit.ac.in/techdofication.html3https://nlp.stanford.edu/projects/glove4https://fasttext.cc

1

by its performance on a recent word analogy task5

and it forces the model to encode the frequencydistribution of words that occur near them in amore global context. FastText, an extension of theWord2Vec model is a word embedding method thathelps to achieve the meaning of shorter words. Itallows the embedding’s to understand suffixes andprefixes and works well with rare words. So evenif a word is not seen during training, it can be bro-ken down into n-grams to obtain its embedding’s.Rarely pre-trained word embeddings are availablefor domain specific text and even so infrequent forresource poor languages such as Persian and Indianlanguages. Hence, fine-tuning a word embeddingusing specific (technical) domain texts can improvethe performance of TC as vectors for words miss-ing in the pre-trained model will be updated by thedomain specific texts (Liao et al., 2010) given fortraining.

Several Machine Learning (ML) and DeepLearning (DL) models are providing effective andaccurate results for TC by reducing false positives(Bhargava et al., 2016). Many DL models are usingBidirectional Long Short Term Memory (BiLSTM)which contains two LSTMs: one taking the com-mitment to a forward course, and the other in aretrogressive way. BiLSTMs are at the core of afew neural models accomplishing cutting edge ex-ecution in a wide assortment of undertakings inNLP (Bhargava et al., 2016). BiLSTM model canuse the pre-trained word embeddings provided byfastText and GloVe.

The rest of the paper is organized as follows.Section 2 highlights the related work followed bythe proposed methodology in Section 3. Experi-ments and results are described in Section 4 andthe paper finally concludes in Section 5.

2 Related Work

Researchers round the globe have developed sev-eral approaches for domain specific TC. Some ofrelated ones are described below:

A strategy to develop sentence vectors (sent2vec)by averaging the word embeddings is proposed by(Liu, 2017) to explore the effect of word2vec onthe performance of sentiments analysis of citations.The authors trained the ACL-Embeddings (300 and100 measurements) from ACL collection and alsoexamined polarity-specific word embeddings (PS-

5https://dzone.com/articles/glove-and-fasttext-two-popular-word-vector-models

Embeddings) for characterizing positive and neg-ative references. The generated embedding is fedto SVM classifier and using 10-fold cross valida-tion they obtained a macro-f1 score of 0.85 and theweighted-f1 score of 0.86 and proved the efficiencyof word2vec on classifying positive and negativecitations.

A domain-specific intent classification for Sin-hala language proposed by (Buddhika et al., 2018)utilized a feed-forward neural network with backpropagation. They trained their model on a bankingdomain-related data set with Mel Frequency Cep-stral Coefficients extracted from a Sinhala speechcorpus of 10 hours and obtained 74% results onidentification accuracy of speech queries. (Zhouet al., 2016) proposed a combination of two mod-els BLSTM-2D Pooling and BLSTM-2D CNNand tested it on six TC tasks, including sentimentanalysis, question classification, subjectivity clas-sification, and newsgroups classification to com-pare their models with the state-of-the-art mod-els. The authors evaluated the performance of pro-posed models on different lengths of sentences andthen conducted a sensitivity analysis of 2D filterand max-pooling size. BLSTM-2DCNN modelachieved excellent performance on 4 out of 6 tasksand obtained 52.4% and 89.5% test accuracies onStanford Sentiment Treebank-1 and Treebank-2datasets respectively.

(Rabbimov and Kobilov, 2020) explored a multi-class TC task to classify Uzbek language text. Theycollected articles on ten classes taken from theUzbek ”Daryo” online news and used six diverseML algorithms namely, Multinomial Naıve Bayes(MNB), Decision Tree Classifier (DTC), SVM,Random Forest (RF) and LR. Hyper parametersfor the classifiers were obtained by Grid search al-gorithms with 5-fold cross-validation. The resultsobtained illustrates that, in many cases the charac-ter n-grams gives better results than the word leveln-grams. However, among all,SVM with the RadialBasis Function kernel (RBF SVM) using TF-IDFand character level four grams features obtainedbest results with an accuracy of 86.88

3 Methodology

Inspired by (Liu, 2017) to use word embeddingsas features for classification, (Zhou et al., 2016)to use BiLSTMs for classification and (Rabbimovand Kobilov, 2020) for using ML algorithms forclassification, we proposed three models, namely,

2

Voting Classifiers using n-grams (VC-ngrams), Vot-ing Classifiers using FineTuned word vectors (VC-FineTuned) and Deep Learning using FineTunedword vectors (DL-FineTuned). Each model is builtin two steps namely, i) Feature Extraction and ii)Model Construction, as explained below:

3.1 Feature Extractionn-grams and word embeddings features are utilizedin the proposed models. Linguistic models haveproven their efficiency in many studies. Hence, n-grams features including Char n-grams (1, 2, 3, 4,5) along with word n-grams (1, 2, 3) are extractedfrom input data and CountVectorizer6 library isused to generate n-grams count vectors which areused in VC-ngrams model.

The pre-trained word embeddings-GloVe, witha vector size 300 is used for English and Banglasubtasks and fastText with a size of 300 for thesubtasks of rest of the languages as features afterfine tuning using the training data. Fine tuning ofvectors in a pre-trained model by training data of aspecific task helps in generating vectors for wordsmissing in pre-trained models, and it can leadto higher performance specifically in fine-grainedtasks such as domain specific TC. Fine-tuned vec-tors are utilized to build embedding matrix to trainVC-FineTuned and DL-FineTuned models.

3.2 Model ConstructionConstruction of the three models is explained be-low:

• VC-ngrams model: An ensemble VotingClassifier with three ML classifiers namely,Multilayer Perceptron (MLP), Logistic Re-gression (LR), and Support Vector Machine(SVM) have been trained on n-grams countvectors obtained in Feature Extraction step.Default parameters are used for SVM and LRclassifiers and for MLP, hidden layer sizes areset to (150, 100, 50) and maximum iteration,activation, solver, and random state have beenset to 300, Relu, Adam and 1 respectively.Structure of VC-Ngrams model is shown infigure-1.

• DL-FineTuned model: This model is createdbased on DL architecture using pre-trainedword embeddings. A pre-trained word em-bedding of size 300 is fine-tuned using the

6https://scikit-learn.org/stable/modules/generated/sklearn.featureextraction.text.CountV ectorizer.html

Figure 1: Structure of VC-ngrams model

specific language training set and the obtainedvectors are used to build embedding matrix.This embedding matrix is used as input forSequential model from Keras7 library to builda BiLSTM network of size 100 with activationand optimizer parameters set to “softmax” and“adam” respectively. The output dimensionsare configured based on the correspondingsubTask’s labels (e.g. seven labels for sub-Task2a) as given by the organizers. The con-structed model has been trained for dynamicepochs till loss value got stabilized (not morethan 20 epochs). Structure of DL-FineTunedmodel is shown in Figure 2.

• VC-FineTuned model: The architecture ofVC-FineTuned model is driven from mergingthe features extraction part of DL-FineTunedmodel and model construction part of VC-Ngrams model. In this model, a pre-trainedword embedding of size 300 is fine-tuned us-ing the specific language training set and theresulting vectors are used to build an ensem-ble Voting Classifier with similar estimatorsas the VC-ngrams model, namely, MLP, SVM,and LR.

The DL-FineTuned model is applied for all the sub-tasks whereas due to lack of pre-trained modelsand the long time taken for training the models,VC-ngrams model is applied for English and Gu-jarathi subTask1 and English subTask2 and VC-FineTuned model is applied for English, Gujarathiand Malayam subtasks1.

7https://keras.io/guides/sequentialmodel

3

Figure 2: Structure of DL-FineTuned model

4 Experimental Results

4.1 DatasetsDatasets provided by TechDOfication 2020 con-sists of train, development, and test sets fornine subtasks of eight languages namely, En-glish, Bangla, Gujarati, Hindi, Malayalam, Marathi,Tamil, and Telugu and the details of the datasetsare given in Table 1. Only the training datasetsfor English subtasks are balanced and the subtasksof all other languages are not balanced. Further,subTask1e (Malayalam) has only three classes andsubTask1d (Hindi) and subTask2a (English) haveseven classes each. More details about the datasetsare available in task website8 and also in Githubrepository9.

4.2 ResultsThe number of participated teams and submittedruns given in Table 2 illustrates that more teamshave registered for English subtasks and number ofruns submitted for English subtasks are also morecompared to other subtasks. Performance of themodels for the subtasks on the development set interms of F1 score are shown in Table 3.

While DL-FineTuned model is applied for all thesubtasks VC-ngrams and VC-FineTuned modelsare applied only for some subtasks. It can be ob-served that DL-FineTuned models perform betterfor some subtasks and VC-ngrams models performbetter for some other subtasks. Also the proposedDL-FineTuned model obtained first position in sub-Task 1b (Bangla) and subTask1e (Malayalam) with

8https://ssmt.iiit.ac.in/techdofication.html9https://github.com/fazlfrs/TechDofication2020

f1 score of 0.8353 and 0.3851 respectively. Theresults illustrate that fine-tuning the vectors byspecific domain text for the subtasks using DL-FineTuned model have performed better comparedto other two models. However, VC-ngrams appliedon the three subtasks have also performed well.

Conclusion and future works

We, team MUCS proposed three models namelyDL-FineTuned model applied for all subtasks, andVC-FineTuned and VC-ngrams models applied foronly few subtasks to identify technical domain ofa given text in Indian languages. The results re-ported by TechDOfication 2020 task organizersillustrate that the proposed DL-FineTuned modeloutperformed in most of the subtasks and also ob-tained first rank in subTask 1b (Bangla) and sub-Task 1e (Malayalam) with f1 score of 0.8353 and0.3851 respectively. We would like to improve ourproposed models by applying other features andalso learning models such as Transfer Learning.

ReferencesRupal Bhargava, Yashvardhan Sharma, and Shubham

Sharma. 2016. Sentiment analysis for mixed scriptindic sentences. In 2016 International conferenceon advances in computing, communications and in-formatics (ICACCI), pages 524–529. IEEE.

Darshana Buddhika, Ranula Liyadipita, SudeepaNadeeshan, Hasini Witharana, Sanath Javasena, and

4

Uthayasanker Thayasivam. 2018. Domain specificintent classification of sinhala speech data. In 2018International Conference on Asian Language Pro-cessing (IALP), pages 197–202. IEEE.

Chunyuan Liao, Hao Tang, Qiong Liu, Patrick Chiu,and Francine Chen. 2010. Fact: fine-grained cross-media interaction with documents via a portable hy-brid paper-laptop interface. In Proceedings of the18th ACM international conference on Multimedia,pages 361–370.

Haixia Liu. 2017. Sentiment analysis of citations usingword2vec. arXiv preprint arXiv:1704.00177.

IM Rabbimov and SS Kobilov. 2020. Multi-class textclassification of uzbek news articles using machinelearning. In Journal of Physics: Conference Series,volume 1546, page 012097.

Chi Sun, Xipeng Qiu, Yige Xu, and Xuanjing Huang.2019. How to fine-tune bert for text classification?In China National Conference on Chinese Computa-tional Linguistics, pages 194–206. Springer.

Peng Zhou, Zhenyu Qi, Suncong Zheng, Jiaming Xu,Hongyun Bao, and Bo Xu. 2016. Text classifi-cation improved by integrating bidirectional lstmwith two-dimensional max pooling. arXiv preprintarXiv:1611.06639.

5


A Graph Convolution Network-based System for Technical DomainIdentification

Alapan Kuila, Ayan Das and Sudeshna SarkarDepartment of Computer Science and Engineering

Indian Institute of Technology Kharagpur,Kharagpur, WB, India-721302

{alapan.kuila, ayan.das, sudeshna}@cse.iitkgp.ac.in

Abstract

This paper presents the IITKGP contribu-tion at the Technical DOmain Identification(TechDOfication) shared task at ICON 2020.In the preprocessing stage, we applied part-of-speech (PoS) taggers and dependency parsersto tag the data. We trained a graph convo-lution neural network (GCNN) based systemthat uses the tokens along with their PoS anddependency relations as features to identify thedomain of a given document. We participatedin the subtasks for coarse-grained domain clas-sification in the English (Subtask 1a), Bengali(Subtask 1b) and Hindi language (Subtask 1d),and, the subtask for fine-grained domain classi-fication task within Computer Science domainin English language (Subtask 2a).

1 Introduction

Text classification is the task of assigning a cate-gory to a given piece of text based on its contentfrom a predefined set of categories (Aggarwal andZhai, 2012; Kowsari et al., 2019). It is a fundamen-tal natural language processing (NLP) task withseveral downstream applications such as sentimentanalysis, topic labeling and machine translation.

The ICON 2020 shared task on technical do-main identification is essentially a text classifica-tion problem which involves identification of thetechnical domain in which a document belongs toa fixed set of domains. In this task, the participantsare expected to develop a system that automaticallyidentifies the technical domain of a given text (asmall passage) in specified language.

We participated in the following subtasks:

1. Subtask 1a: Coarse grained domain classifi-cation in English.

2. Subtask 1b: Coarse grained domain classifi-cation in Bengali.

3. Subtask 1d: Coarse grained domain classifi-cation in Hindi.

4. Subtask 2a: Fine grained domain classifica-tion within Computer Science domain in En-glish.

We applied PoS tagging and dependency parsingon the training and test data using PoS taggers anddependency parsers in the corresponding languages.We have used the PoS tags and the dependencyrelations of the tokens with their parent nodes inthe dependency trees as features of the tokens inour system.

In the domain identification stage, the techni-cal domain of a document is predicted from therepresentation of the document obtained using aGCNN, that takes the contextual representations ofthe constituent tokens of the document and a graphrepresentation of the document as input. The con-textual representations of the tokens are obtained byusing a multi-layer bi-directional long-short termmemory (LSTM) (Hochreiter and Schmidhuber,1997; Schuster and Paliwal, 1997) that takes therepresentations of the tokens as input.

Corresponding to each subtask we submitted sin-gle runs of the systems.

2 Our Proposed Model

In this section, we will discus our proposed tech-nical domain identification system in detail. Thesteps for predicting the domain of a given docu-ment are as follows.

S1. We partitioned the document into constituentsentences and tokenized each sentence into itsconstituent words.

S2. We PoS tagged and dependency parsed eachsentence.

6

S3. We obtained the contextual representation ofeach token in the document by applying amulti-layer bi-directional LSTM on the repre-sentations of the tokens where we consideredthe entire document as a single sequence.

S4. We used the contextual representations of thetokens obtained in the previous step and thegraph representation of the document as inputto a GCNN to obtain the final feature repre-sentation of each word.

S5. We combined the final feature representationsof the tokens to derive the document represen-tation.

S6. We passed the document representation asinput to a multi-layer perceptron (MLP) fol-lowed by a softmax layer to predict domain ofthe document.

2.1 Partitioning of Document into Sentencesand Tokenization

We derived the sentences from the documents bypartitioning the document on the following charac-ters: “.”, “?” and “!”. The sentences were tokenizedbased on spaces. The tokens are essentially thespace separated word in the sentences.

2.2 PoS Tagging and Dependency Parsing ofSentences

Before training or testing we PoS tagged and de-pendency parsed the sentences. We PoS taggedand parsed the English sentences in the subtasks1a and 2a using the SpaCy library (Honnibal andJohnson, 2015; Honnibal and Montani, 2017). TheBangla and Hindi sentences for the subtasks 1b and1d respectively were PoS tagged and dependencyparsed using a LSTM based sequence tagger (Qiet al., 2018) and a bi-affine graph based depen-dency parser (Dozat et al., 2017). The Bangla tag-ger and parser were trained using a Bangla treebankdeveloped in our institute. The Hindi tagger andparser were trained using the Universal Dependen-cies v2.0 Hindi treebank (Nivre et al., 2016).

2.3 Generation of Contextual Representationof Tokens

We obtained the contextual representation of thetokens in a document by passing the distributedrepresentations of the tokens in the document asinput to a 3-layer bi-directional LSTM. Here wetreated the entire document as a sequence. Each

token is represented as the concatenation of theembedding of the token, the embedding of its PoStag, the embedding of its dependency relation withits parent in the dependency tree of the sentence,and, the character-level representation of the wordobtained by applying a convolutional neural net-work (Goodfellow et al., 2016) on the embeddingsof the constituent characters of the token.

All the embeddings were randomly initializedand updated during training phase.

2.4 Graph Convolution Neural Network toGenerate the Document Representation

In this section, we discuss the implementation ofGCNN (Kipf and Welling, 2016) used in our sys-tem.

2.4.1 Input LayerThe input to the the GCNN comprises of the contex-tual representations of the tokens in the documentand the directed graph representation of the docu-ment in the form of an adjacency matrix.

2.4.2 Graph ConstructionThe number of nodes |V | in the document graph isthe total number of tokens contained in the docu-ment and the graph edges are represented by var-ious intra- and inter-sentence dependency edgeswhich are discussed below.

• Syntactic dependency edges: The edge setcomprises of the edges between all token pairsthat are linked by head-dependent relations inthe dependency tree of the corresponding sen-tence. To consider that the information flowsin both forward and backward directions ofsyntactic dependency arcs, we included bothforward and backward edges in the graph edgeset. More precisely, if there is an arc from atoken ti to a token tj in the dependency tree,then A(i, j) = 1 and A(j, i) = 1.

• Adjacent sentence edges: In order to keepsequential information flows through the con-secutive sentences we also connected the rootnodes of the neighbouring sentences. There-fore, Aadj(i, j) = 1 and Aadj(j, i) = 1, ifthe tokens ti and tj are the root nodes of theconsecutive sentences.

• Self-node edges: We also include self nodeedges in the graph. Therefore, for all the to-kens ti in the document, A(i, i) = 1.

7

2.4.3 GCNN Layer

GCNN is an advanced version of CNN operatingon graphs that induce the node features based onthe properties of their neighbouring nodes. GCNNwith one layer of convolution can capture informa-tion of only immediate neighbours. When multipleGCNN layers are stacked, information from largerneighbourhoods are accumulated. Let A be theadjacency matrix of the text-graph. A contains theedges as stated in Section 2.4.2 and the adjacencymatrix is of the dimension |V | × |V |, where |V | isthe number of tokens in the document. After stack-ing k GCNN layers we get the adjacency matrix forthe k-th-order text graph Ak, where, Ak = (A)k.Ak holds all the k-hop paths in the text-graph. Theith word representation of the (k + 1)-layer text-graph (Ak) is calculated as:

hk+1i = g(hki , A

k)

where, g(.) refers to the graph convolution functionand + indicates element wise summation operation.The function g(.) is defined as:

g(hki , A

k)= σ(

|V |∑

j=1

(Ak(i, j)(W kA ∗ hk + bkA)))

Here, σ indicates the ReLU activation function.W k

A and bkA are the weight matrix and bias item forAk.

The initial value of the representation of the ith

token h0i is the contextual representation of thetoken ti from the output of the LSTM (Section 2.3).

In our system we have trained a stacked GCNNof 3 blocks. The model architecture is shown inFigure 1.

Figure 1: Block diagram of our proposed model

Sub-task Classes

# of docs.Train Dev Test

1a

cse, che,physics,

law, math 23962 4850 2500

1b

bioche, mgmt,com tech,phy, cse 58500 5842 1921

1d

other, mgmt, phy,com tech, cse,math, bioche 148445 14338 4211

2aca, se, algo, cn,pro, ai, dbms 13580 1360 1929

Table 1: Statistics of datasets for the different subtasks

Training settings

Loss functionCross-entopy overthe domain classes

OptimizerAdam (lr=1e-4,

eps=0.1)System settings

Token embeddings 140PoS embeddings 20

Relation embeddings 20Character embeddings 20# of char CNN kernel 3# of char CNN filter 30

LSTM layers 3LSTM hidden size 100

GCNN stack 3GCNN size 500, 200, 200Dropout rate 0.25

Table 2: System settings and hyper-parameters

2.5 Document Representation and DomainPrediction

A representation of the document is obtained asthe point-wise mean of the token representationsobtained as the output of the GCNN (Section 2.4).Finally, the document representation is passed asinput to a MLP layer followed by a softmax layer.The output of the softmax layer is a probabilitydistribution over the possible domain classes. Theclass label with the highest probability value ispredicted as the class of the document.

3 Data Description

The statistics of the data corresponding to the dif-ferent subtasks in which we have participated aresummarized in Table 1.

8

SubtaskAccuracy Precision Recall F1-score

GCN Best GCN Best GCN Best GCN BestSubtask-1a 0.6564 0.8156 0.6850 0.8155 0.6564 0.8156 0.6560 0.8144Subtask-1b 0.6878 0.8335 0.7432 0.8420 0.7023 0.8515 0.6897 0.8353Subtask-1d 0.4656 0.6117 0.4706 0.6478 0.4523 0.5989 0.4523 0.6044Subtask-2a 0.7029 0.8252 0.7021 0.8265 0.7034 0.8252 0.7008 0.8244

Table 3: Performance of our system for the different subtasks and comparison with overall best performing systems.The column heading GCN indicates our system and Best indicates overall best performing system.

4 Training Setup

In this section we present the settings and hyper-parameters used in our domain prediction system.In Table 2 we summarize the system settings.

5 Results

In this section, we present the performance of oursystems corresponding to the different subtasks. InTable 3 we present the performance of our systemsin terms of their accuracy, precision, recall and F1-score on the test data and compare them with thebest results achieved for that subtask.

6 Error Analysis

We have inspected the outputs of our system on thedevelopment set. Thorough error analysis has re-vealed several type of errors. Aside from the errorspropagated from the PoS tagger and dependencyparser, some of the errors generated in the outputare discussed bellow:

1. As some of the classes are very related andlots of common terms exist in some of theseclasses, the system has confused to detect thecorrect class.

• S1: and so we have shown e i square ofx is e i of x for all x .

• S2: so , grad f at r t naught is a vector ,whose dot product with any tangent vec-tor is 0.

These instances are from Subtask 1a. S1is detected as physics (actual class: math)where as S2 is detected as math (actual class:physics). Same errors happen in case of Sub-task 1b for classes: cse (Computer Science)and com tech(Communication Technology)

2. Most of the candidate texts are single sen-tences. Sometimes system has failed to iden-tify the correct class as it could not understand

the contextual information from those singlesentences. For example:

• S3: Why that is what is the reason why?

System has failed to classify the domain ofthese sentences properly. From these errors,it is evident that the problem of domain selec-tion (among highly related classes) is really achallenging task. The accuracy and F-1 scoreof our system are also lower than the best sys-tem submitted. In future we will apply someother data processing techniques and smartermachine learning models to improve our per-formance.

ReferencesCharu C. Aggarwal and ChengXiang Zhai. 2012. A

survey of text classification algorithms. pages 163–222.

Timothy Dozat, Peng Qi, and Christopher D. Manning.2017. Stanford’s graph-based neural dependencyparser at the CoNLL 2017 shared task. In Proceed-ings of the CoNLL 2017 Shared Task: MultilingualParsing from Raw Text to Universal Dependencies,pages 20–30, Vancouver, Canada. Association forComputational Linguistics.

Ian Goodfellow, Yoshua Bengio, and Aaron Courville.2016. Deep Learning. MIT Press. http://www.deeplearningbook.org.

Sepp Hochreiter and Jurgen Schmidhuber. 1997. Longshort-term memory. Neural Comput., 9(8):1735–1780.

Matthew Honnibal and Mark Johnson. 2015. An im-proved non-monotonic transition system for depen-dency parsing. In Proceedings of the 2015 Con-ference on Empirical Methods in Natural LanguageProcessing, pages 1373–1378, Lisbon, Portugal. As-sociation for Computational Linguistics.

Matthew Honnibal and Ines Montani. 2017. spaCy 2:Natural language understanding with Bloom embed-dings, convolutional neural networks and incremen-tal parsing. To appear.

9

Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutionalnetworks. arXiv preprint arXiv:1609.02907.

Kamran Kowsari, Kiana Jafari Meimandi, Mojtaba Hei-darysafa, Sanjana Mendu, Laura E. Barnes, and Don-ald E. Brown. 2019. Text classification algorithms:A survey.

Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin-ter, Yoav Goldberg, Jan Hajic, Christopher Man-ning, Ryan McDonald, Slav Petrov, Sampo Pyysalo,Natalia Silveira, Reut Tsarfaty, and Daniel Zeman.2016. Universal Dependencies v1: A multilingualtreebank collection. In Proceedings of the 10th In-ternational Conference on Language Resources andEvaluation (LREC 2016), pages 1659–1666, Por-toroz, Slovenia. European Language Resources As-sociation.

Peng Qi, Timothy Dozat, Yuhao Zhang, and Christo-pher D. Manning. 2018. Universal dependency pars-ing from scratch. In Proceedings of the CoNLL 2018Shared Task: Multilingual Parsing from Raw Textto Universal Dependencies, pages 160–170, Brus-sels, Belgium. Association for Computational Lin-guistics.

M. Schuster and K. K. Paliwal. 1997. Bidirectional re-current neural networks. IEEE Transactions on Sig-nal Processing, 45(11):2673–2681.

10


Multichannel LSTM-CNN for Telugu Technical Domain Identification

Sunil GundapuLanguage Technologies Research Centre

KCIS, IIIT HyderabadTelangana, India

[email protected]

Radhika MamidiLanguage Technologies Research Centre

KCIS, IIIT HyderabadTelangana, India

[email protected]

Abstract

With the instantaneous growth of text informa-tion, retrieving domain-oriented informationfrom the text data has a broad range of applica-tions in Information Retrieval and Natural lan-guage Processing. Thematic keywords give acompressed representation of the text. Usually,Domain Identification plays a significant rolein Machine Translation, Text Summarization,Question Answering, Information Extraction,and Sentiment Analysis. In this paper, we pro-posed the Multichannel LSTM-CNN method-ology for Technical Domain Identification forTelugu. This architecture was used and eval-uated in the context of the ICON shared task“TechDOfication 2020” (task h), and our sys-tem got 69.9% of the F1 score on the testdataset and 90.01% on the validation set.

1 Introduction

Technical Domain Identification is the task of auto-matically identifying and categorizing a set of unla-beled text passages/documents to their correspond-ing domain categories from a predefined domaincategory set. The domain category set consistsof 6 category labels: Bio-Chemistry, Communica-tion Technology, Computer Science, Management,Physics, and Other. These domains can be viewedas a set of text passages, and test text data can betreated as a query to the system. Domain identifi-cation has many applications like Machine Trans-lation, Summarization, Question Answering, etc.This task would be the first step for most down-stream applications (i.e., Machine Translation). Itdecides the domain for text data, and afterward,Machine Translation can choose its resources asper the identified domain.

Majority of the research work in the area of textclassification and domain identification has beendone in English. There has been well below con-tribution for regional languages, especially Indian

Languages. Telugu is one of India’s old traditionallanguages, and it is categorized as one of the Dra-vidian language family. According to the Ethno-logue1 list, there are about 93 million native Teluguspeakers, and it ranks sixteenth most-spoken lan-guages worldwide.

We tried to identify the domain of Telugu textdata using various Supervised Machine Learningand Deep Learning techniques in our work. OurMultichannel LSTM-CNN method outperforms theother methods on the provided dataset. This ap-proach incorporates the advantages of CNN andSelf-Attention based BiLSTM into one model.

The rest of the paper is structured as follows:Section 2 explains some related works of domainidentification, Section 3 describes the dataset pro-vided in the shared task, Section 4 addresses themethodology applied in the task, Section 5 presentsthe results and error analysis, and finally, Section6 concludes the paper as well as possible futureworks.

2 Related Work

Several methods for domain identification and textcategorization have been done on Indian languages,and few of the works have been reported on theTelugu language. In this section, we survey some ofthe methodologies and approaches used to addressdomain identification and text categorization.

Murthy (2005) explains the automatic text cate-gorization with special emphasis on Telugu. In hisresearch work, supervised classification using theNaive Bayes classifier has been applied to 800 Tel-ugu news articles for text categorization. Swamyet al. (2014) work on representing and categorizingIndian language text documents using text miningtechniques K Nearest Neighbour, Naive Bayes, anddecision tree classifier.

1https://www.ethnologue.com/guides/ethnologue200

11

Categorization of Telugu text documents usinglanguage-dependent and independent models pro-posed by Narala et al. (2017). Durga and Govard-han (2011) introduced a model for document clas-sification and text categorization. In their paperdescribed a term frequency ontology-based textcategorization for Telugu documents. CombiningLSTM and CNN’s robustness, Liu et al. (2020) pro-posed Attention-based Multichannel ConvolutionalNeural Network for text classification. In theirnetwork, BiLSTM encodes the history and futureinformation of words, and CNN capture relationsbetween words.

3 Dataset Description

We used the dataset provided by the organizersof Task-h of TechDOfication 2020 for training themodels. The data for the task consists of 68865 textdocuments for training, 5920 for validation, 2611for testing. For hyperparameter tuning, we usedthe validation set provided by the organizers. Thestatistics of the dataset are shown in Table 1. Andthe amount of texts for the dataset can be seen inFigure 1.

Labels(↓) Train Data Validation Datacse 24937 2175phy 16839 1650

com tech 11626 970bio tech 7468 580

mgnt 2347 155other 5648 390Total 68865 5920

Table 1: Dataset Statistics

Figure 1: Number of samples per class

4 Proposed Approach

4.1 Data Preprocessing

The text passages have been originally provided inthe Telugu script with the corresponding domaintags. The text documents have some noise, so be-fore passing the text to the training stage, they arepreprocessed using the following procedure:

• Acronym Mapping Dictionary: We createdan acronym mapping dictionary. Expandedthe English acronyms using the acronym map-ping dictionary.

• Find Language Words: Sometimes, Englishwords are co-located with Telugu words in thepassage. We find the index of those words totranslate into Telugu.

• Translate English Words: Translate the En-glish words into the Telugu language, whichare identified in the first stage of preprocess-ing. Google’s Translation API2 was used forthis purpose.

• Hindi Sentence Translation: We can ob-serve a few Hindi sentences in the dataset. Wetranslated those sentences into Telugu usingGoogle translation tool.

• Noise Removal: Removed the unnecessarytokens, punctuation marks, non-UTF formattokens, and single length English tokens fromthe text data.

4.2 Supervise Machine Learning Algorithms

To build the finest system for domain identification,we started with supervised machine learning tech-niques then moved to deep learning models. SVM,Multilayer Perceptron, Linear Classifier, GradientBoosting methods performed very well on the giventraining dataset. These supervised models trainedon the word level, n-gram level, and character levelTF-IDF vector representations.

4.3 Multichannel LSTM-CNN Architecture

We started experiments with individual LSTM,GRU, CNN models with different word embed-dings like word2vec, glove and fasstext. How-ever, ensembling of CNN with self-attention LSTMmodel gave better results than individual models.

2https://py-googletrans.readthedocs.io/en/latest/

12

We develop a multichannel model for domainidentification consisting of two main components.The first component is a long short time mem-ory (Hochreiter and Schmidhuber, 1997) (hence-forth, LSTM). The advantage of LSTM can han-dle the long term dependencies but does not storethe global semantics of foregoing information ina variable-sized vector. The second componentis a convolutional neural network (LeCun, 1989)(henceforth, CNN). The advantage of CNN cancapture the n-gram features of text by using con-volution filters, but it restricts the performance dueto convolutional filters size. By considering thestrengths of these two components, we ensemblethe LSTM and CNN model for domain identifica-tion.

Figure 2: Multichannel LSTM-CNN Model

4.3.1 Self-Attention BiLSTM ClassifierThe first module in architecture is Self-Attentionbased BiLSTM classifier. We employed this self-attention (Kelvin Xu and Bengion, 2015) basedBiLSTM model to extract the semantic and senti-ment information from the input text data. Self-attention is an intra-attention mechanism in whicha softmax function gives each subword’s weightsin the sentence. The outcome of this module isa weighted sum of hidden representations at eachsubword.

The self-attention mechanism is built on BiL-STMs architecture (See figure 3), and it takes inputas pre-trained embeddings of the subwords. Wepassed the Telugu fasttext (Grave et al., 2018) sub-word embeddings to a BiLSTM layer to get hiddenrepresentation at each timestep, which is the inputto the self-attention component.

Suppose the input sentence S is given by thesubwords (w1, w2, ..., wn). Let

−→h represents the

forward hidden state and←−h represents the back-

ward hidden state at ith position in BiLSTM. Themerged representation ki is obtained by combining

Figure 3: Self-Attention BiLSTM

the forward and backward hidden states. We con-catenate the forward and backward hidden units toget the merged representations (k1, k2, ...kn).

ki = [−→hi ;←−hi ] (1)

The self-attention model gives a score ei to eachsubword i in the sentence S, as given by belowequation

ei = kTi kn (2)

Then we calculate the attention weight ai bynormalizing the attention score ei

ai =exp(ei)∑nj=1 exp(ej)

(3)

Finally, we compute the sentence S latent repre-sentation vector h using below equation

h =

ai×ki∑

i=1

(4)

The latent representation vector h is fed to afully connected layer followed by a softmax layerto obtain probabilities Plstm.

4.3.2 Convolutional Neural NetworkThe second component is CNN, which consider theordering of the words and the context in which eachword appears in the sentence. We present Telugufasstext subword embeddings (Bojanowski et al.(2016)) of a sentence to 1D-CNN (See figure 4) togenerate the required embeddings.

13

Figure 4: CNN Classifier

Initially, we present a d × S sentence embed-ding matrix to the convolution layer. Each row isa d-dimension fasstext subword embedding vec-tor of each word, and S is sentence length. Weperform convolution operations in the convolutionlayer with three different kernel sizes (2, 4, and6). The purpose behind using various kernel sizeswas to capture contexts of varying lengths and toextract local features around each word window.The output of convolution layers was passed to cor-responding max-pooling layers. The max-poolinglayer is used to preserve the word order and bringout the important features from the feature map.We change the original max-pooling layer in theconvolution neural network with the word order-preserving k-max-pooling layer to preserve the in-putted sentences word order. The order perseveringmax-pooling layer reduces the number of featureswhile preserving the order of these words.

The max-pooling layer output is concatenatedtogether fed to a fully connected layer followedby a softmax layer to obtain softmax probabilitiesPcnn.

The CNN and BiLSTM models softmax prob-abilities are aggregated (See figure 2) using anelement-wise product to obtain the final probabil-ities Pfinal = Pcnn ◦ Plstm. We tried various ag-gregation techniques like average, maximum, min-imum, element-wise addition, and element-wisemultiplication to combine LSTM and CNN mod-els’ probabilities. But an element-wise productgave better results than other techniques.

5 Results and Error Analysis

We first started our experiments with machine learn-ing algorithms with various kinds of TF-IDF fea-ture vector representations. SVM, MLP gave prettygood results on the validation dataset but failedon bio-chemistry and management data points.And CNN model with fasstext word embeddingsseemed to be confused between the computer tech-nology and cse data points (See table 2). Maybe thereason behind this confusion is that both datapointsquite similar at the syntactic level.

The self-attention based BiLSTM model outper-forms the CNN model on physics, cse, and com-puter technology data points though it performsworse on the bio-chemistry and management datapoints. If we observe the training set, 75% of thedata samples belong to physics, cse, and computertechnology domains remaining 25% of data ownedby the remaining domain labels. So we were as-suming that an imbalanced training set was alsoone of the problems for misclassification of bio-chemistry and management domain samples. Wetried to handle the this data-skewing problem withSMOTE (Chawla et al., 2002) technique, but it isdoesn’t work very well.

After observing the CNN and BiLSTM modelresults, we ensemble these two models for betterresults. The ensemble Multichannel LSTM CNNmodel outperforms all our previous models, achiev-ing a recall of 0.90 with the weighted F1-score of0.90 on the development dataset and a recall of0.698 with an F1-score of 0.699 on the test dataset(See table 3).

14

Validation Data Test Data

Model Accuracy Precision Recall F1-Score Accracy Precision Recall F1-Score

SVM 0.8612 0.8598 0.8601 0.866 - - - -CNN 0.8890 0.8874 0.8897 0.8902 0.6465 0.6558 0.6948 0.6609

BiLSTM 0.8706 0.8714 0.8702 0.8786 0.6465 0.6519 0.6927 0.6571Multichannel 0.9001 0.9008 0.9006 0.9003 0.6928 0.6984 0.7228 0.6992

Organizers System - - - - 0.6855 0.6974 0.7289 0.7028

Table 2: Comparison between various model results

Measures(↓) Validation Data Test Dataprecision 0.90 0.7228

recall 0.90 0.6984f1score 0.90 0.6992

accuracy 0.9003 0.6928

Table 3: Performance of multichannel system

6 Conclusion

In this paper, we proposed a multichannel approachthat integrates the advantages of CNN and LSTM.This model captures local, global dependencies,and sentiment in a sentence. Our approach givesbetter results than individual CNN, LSTM, andsupervised machine learning algorithms on the Tel-ugu TechDOfication dataset.

As discussed in the previous section, we willhandle the data imbalance problem efficiently infuture work. And we will improve the performanceof the unambiguous cases in cse and computer tech-nology domains.

References

Piotr Bojanowski, Edouard Grave, Armand Joulin,and Tomas Mikolov. 2016. Enriching word vec-tors with subword information. arXiv preprintarXiv:1607.04606.

Nitesh Chawla, Kevin Bowyer, Lawrence Hall, andW. Kegelmeyer. 2002. Smote: Synthetic minorityover-sampling technique. J. Artif. Intell. Res. (JAIR),16:321–357.

A. Durga and A. Govardhan. 2011. Ontology basedtext categorization - telugu documents.

Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Ar-mand Joulin, and Tomas Mikolov. 2018. Learningword vectors for 157 languages. In Proceedingsof the International Conference on Language Re-sources and Evaluation (LREC 2018).

Sepp Hochreiter and Jurgen Schmidhuber. 1997. Longshort-term memory. Neural computation, 9:1735–80.

Ryan Kiros Kyunghyun Cho Aaron C. Courville Rus-lan Salakhutdinov Richard S. Zemel Kelvin Xu,Jimmy Ba and Yoshua Bengion. 2015. Show, attendand tell: Neural image caption generation with vi-sual attention. CoRR, pages 1735–80.

Y. LeCun. 1989. Generalization and network designstrategies.

Zhenyu Liu, Haiwei Huang, Chaohong Lu, andShengfei Lyu. 2020. Multichannel cnn with atten-tion for text classification. ArXiv, abs/2006.16174.

Kavi Murthy. 2005. Automatic categorization of telugunews articles.

Swapna Narala, B. P. Rani, and K. Ramakrishna. 2017.Telugu text categorization using language models.Global journal of computer science and technology.

M. N. Swamy, M. Hanumanthappa, and N. Jyothi.2014. Indian language text representation andcategorization using supervised learning algorithm.2014 International Conference on Intelligent Com-puting Applications, pages 406–410.

15


Multilingual Pre-Trained Transformers and Convolutional NNClassification Models for Technical Domain Identification

Suman DowlagarLTRC

IIIT-Hyderabadsuman.dowlagar@

research.iiit.ac.in

Radhika MamidiLTRC

IIIT-Hyderabadradhika.mamidi@

iiit.ac.in

Abstract

In this paper, we present a transfer learningsystem to perform technical domain identifica-tion on multilingual text data. We have submit-ted two runs, one uses the transformer modelBERT, and the other uses XLM-ROBERTawith the CNN model for text classification.These models allowed us to identify the do-main of the given sentences for the ICON2020 shared Task, TechDOfication: Techni-cal Domain Identification. Our system rankedthe best for the subtasks 1d, 1g for the givenTechDOfication dataset.

1 Introduction

Automated technical domain identification is a cat-egorization/classification task where the given textis categorized into a set of predefined domains.It is employed in tasks like Machine Translation,Information Retrieval, Question Answering, Sum-marization, and so on.

In Machine Translation, Summarization, Ques-tion Answering, and Information Retrieval, the do-main classification model will help leverage thecontents of technical documents, select the appro-priate domain-dependent resources, and providepersonalized processing of the given text.

Technical domain identification comes undertext classification or categorization. Text classi-fication is one of the fundamental tasks in the fieldof NLP. Text classification is the process of iden-tifying the category where the given text belongs.Automated text classification helps to organize un-structured data, which can help us gather insight-ful information to make future decisions on down-stream tasks.

Traditional text classification approaches mainlyfocus on feature engineering techniques such asbag-of-words and classification algorithms (Yang,1999). Nowadays, the sate-of-the-art results on text

classification are achieved by various NNs such asCNN (Kim, 2014), LSTM (Hochreiter and Schmid-huber, 1997), BERT (Adhikari et al., 2019), andText GCN (Adhikari et al., 2019). Attention mech-anisms (Vaswani et al., 2017) have been introducedin these models, which increased the representa-tiveness of the text for better classification. Trans-former models such as BERT (Devlin et al., 2018)uses the attention mechanism that learns contex-tual relations between words or sub-words in atext. Text GCN (Yao et al., 2019) uses a graph-convolutional network to learn a heterogeneousword document graph on the whole corpus, whichhelped classify the text. However, of all the deeplearning approaches, transformer models providedSOTA results in text classification.

In this paper, We present two approaches fortechnical domain identification. One approach usesthe pre-trained Multilingual BERT model, and theother uses XLM-ROBERTa with CNN model.

The rest of the paper is structured as follows.Section 2 describes our approach in detail. In Sec-tion 3, we provide the analysis and evaluation ofresults for our system, and Section 4 concludes ourwork.

2 Our Approach

Here we present two approaches for the TechDOfi-cation task.

2.1 BERT for TechDOficationIn the first approach, we use the pre-trained mul-tilingual BERT model for domain identificationof the given text. Bidirectional Encoder Repre-sentations from Transformers (BERT) is a trans-former encoder stack trained on the large corpora.Like the vanilla transformer model (Vaswani et al.,2017), BERT takes a sequence of words as input.Each layer applies self-attention, passes its resultsthrough a feed-forward network, and then hands

16

Figure 1: The architecture of the BERT model for sen-tence classification.

it off to the next encoder. The BERT configura-tion model takes a sequence of words/tokens at amaximum length of 512 and produces an encodedrepresentation of dimensionality 768.

The pre-trained multilingual BERT models havea better word representation as they are trained on alarge multilingual Wikipedia and book corpus. Asthe pre-trained BERT model is trained on genericcorpora, we need to finetune the model for the givendomain identification tasks. During finetuning, thepre-trained BERT model parameters are updated.

In this architecture, only the [CLS] (classifica-tion) token output provided by BERT is used. The[CLS] output is the output of the 12th transformerencoder with a dimensionality of 768. It is given asinput to a fully connected neural network, and thesoftmax activation function is applied to the neuralnetwork to classify the given sentence.

2.2 XLM-ROBERTa with CNN forTechDOfication

Figure 2: The architecture of the XLM-ROBERTa withCNN for sentence classification.

XLM-ROBERTa (Conneau et al., 2019) is atransformer-based multilingual masked languagemodel pre-trained on the text in 100 languages,which obtains state-of-the-art performance oncross-lingual classification, sequence labeling, andquestion answering. XLM-ROBERTa improvesupon BERT by adding a few changes to the BERTmodel such as training on a larger dataset, dy-namically masking out tokens compared to theoriginal static masking, and uses a known pre-processing technique (Byte-Pair-Encoding) and adual-language training mechanism with BERT inorder to learn better relations between words in dif-ferent languages. The given model is trained for thelanguage modeling task, and the output is of dimen-sionality 768. It is given as input to a CNN (Kim,2014) because convolution layers can extract bet-ter data representations than Feed Forward layers,which indirectly helps in better domain identifica-tion.

3 Experiment

This section presents the datasets used, the taskdescription, and two models’ performance on tech-nical domain identification. We also include ourimplementation details and error analysis in thesubsequent sections.

3.1 Dataset

We used the dataset provided by the organizersof TechDOfication ICON-2020. There are twosubtasks, one is coarse-grained, and the otheris fine-grained. The coarse-grained TechDOfica-tion dataset contains sentences about Chemistry,Communication Technology, Computer Science,Law, Math, and Physics domains in different lan-guages such as English, Bengali, Gujarati, Hindi,Malayalam, Marathi, Tamil, and Telugu. Whereasthe fine-grained English dataset focuses on theComputer-Science domain with sub-domain labelsas Artificial Intelligence, Algorithm, Computer Ar-chitecture, Computer Networks, Database Manage-ment system, Programming, and Software Engi-neering.

3.2 Implementation

For the implementation, we used the transform-ers library provided by HuggingFace1. The Hug-gingFace contains the pre-trained multilingualBERT, XLM-ROBERTa, and other models suitable

1https://huggingface.co/

17

for downstream tasks. The pre-trained multilin-gual BERT model used is “bert-base-multilingual-cased” and pre-trained XLM-R model used is “xlm-roberta-base”. We programmed the CNN architec-ture as given in the paper (Kim, 2014). We usedthe PyTorch library that supports GPU processingfor implementing deep neural nets. The BERTmodels were run on the Google Colab and KaggleGPU notebooks. We trained our classifier with abatch size of 128 for 10 to 30 epochs based onour experiments. The dropout is set to 0.1, and theAdam optimizer is used with a learning rate of 2e-5.We used the hugging face transformers pre-trainedBERT tokenizer for tokenization. We used theBertForSequenceClassification module providedby the HuggingFace library during finetuning andsequence classification for the multilingual-BERTbased approach.

3.3 Baseline modelsHere, we compared the BERT model with othermachine learning algorithms.

SVM with TF IDF text representation Wechose Support Vector Machines (SVM) withTF IDF text representation for technical domainidentification. SVM classifier and TF IDF vectorrepresentation is obtained from the scikit-learn li-brary (Pedregosa et al., 2011).

CNN: Convolutional Neural Network (Kim,2014). We explored CNN-non-static, which usespre-trained word embeddings.

3.4 ResultsThe results are tabulated in Table 1. We evalu-ated the performance of the method using macroF1. The multilingual-BERT model performed wellwhen compared to the other SVM with TF-IDF andCNN models. Given all the languages, we have ob-served an increase of 7 to 25% in classificationmetrics for BERT compared to the baseline SVMclassifier, it showed a 2 to 5% increase in classifi-cation metrics compared to the CNN classifier onthe validation data. On the test data, multilingualBERT showed better performance in subtasks 1a,1b, 1c, 1h and 2a whereas XLM-ROBERTa withCNN showed better performance in the subtasks 1d,1e, 1f, 1g. This increase in classification metrics isdue to the transformer model’s and convolutionalNN’s capability, which learned better text repre-sentations from the generic data than other models.

4 Error Analysis

The multilingual-BERT model’s confusion matrixis compared with the poorly performed model forlanguages, Hindi, and Tamil languages are shownin Figure 3. We chose Hindi and Tamil languagesbecause, here, the difference in performance ismore significant. For the Hindi subtask, the SVMclassifier confused between “cse”, “com tech”, and“mgmt” labels, whereas the BERT model performedbetter. For the Tamil subtask, the SVM classifierconfused between “com tech” and “mgmt” labels,whereas the BERT model performed better thanthe other models. This is because both the ap-proaches (pre-trained multilingual-BERT and pre-trained XLM-ROBERTa with CNN) learned betterrepresentation of the above data than the other mod-els that helped in technical document identification.

5 Conclusion and Future work

We used pre-trained bi-directional encoder rep-resentations using multilingual-BERT and XLM-ROBERTa with CNN technical domain identifica-tion for English, Bengali, Gujarati, Hindi, Malay-alam, Marathi, Tamil, and Telugu languages. Wecompared the approaches with the baseline meth-ods. Our analysis showed that pre-trained mul-tilingual BERT and XLM-ROBERTa with CNNmodels and finetuning it for text classification tasksshowed an increase in macro F1 score and accuracymetrics compared to baseline approaches.

Some datasets are large, like for the Hindi, Tamil,and Telugu, we can train the BERT and XLM-ROBERTa models from scratch and consider itshidden layer representation, and concatenate thiswith the representation of the pre-trained model. Itmight help to classify the datasets even better.

ReferencesAshutosh Adhikari, Achyudh Ram, Raphael Tang, and

Jimmy Lin. 2019. Docbert: Bert for document clas-sification. arXiv preprint arXiv:1904.08398.

Alexis Conneau, Kartikay Khandelwal, Naman Goyal,Vishrav Chaudhary, Guillaume Wenzek, FranciscoGuzman, Edouard Grave, Myle Ott, Luke Zettle-moyer, and Veselin Stoyanov. 2019. Unsupervisedcross-lingual representation learning at scale. arXivpreprint arXiv:1911.02116.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2018. Bert: Pre-training of deepbidirectional transformers for language understand-ing. arXiv preprint arXiv:1810.04805.

18

Classifier ModelsDataset Validation Test

SVM CNN M-Bert XLM-R+CNN M-Bert XLM-R+CNNEnglish subtask-1a 81.48 83.05 88.87 87.09 79.84 73.57Bengali subtask-1b 66.35 85.78 86.81 85.71 80.35 78.17Gujarati subtask-1c 69.63 86.27 87.21 86.89 68.67 66.73Hindi subtask-1d 58.21 81.03 83.40 82.13 59.89 60.44Malayalam subtask-1e 80.60 92.51 94.72 93.40 34.47 34.86Marathi subtask-1f 73.32 86.89 87.42 86.37 59.52 59.89Tamil subtask-1g 65.95 85.75 87.50 86.54 49.24 51.34Telugu subtask-1h 71.98 88.07 90.28 89.43 67.17 62.26English subtask-2a 70.24 72.53 77.36 76.77 78.98 78.07

Table 1: macro F1 on validation and test data for all the subtasks

(a) (b)

(c) (d)

Figure 3: Confusion matrix on the given validation data for the Hindi and Tamil languages

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

Yoon Kim. 2014. Convolutional neural net-works for sentence classification. arXiv preprintarXiv:1408.5882.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-

esnay. 2011. Scikit-learn: Machine learning inPython. Journal of Machine Learning Research,12:2825–2830.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, ŁukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. In Advances in neural information pro-cessing systems, pages 5998–6008.

Yiming Yang. 1999. An evaluation of statistical ap-

19

proaches to text categorization. Information re-trieval, 1(1-2):69–90.

Liang Yao, Chengsheng Mao, and Yuan Luo. 2019.Graph convolutional networks for text classification.In Proceedings of the AAAI Conference on ArtificialIntelligence, volume 33, pages 7370–7377.

20


Technical Domain Identification using word2vec and BiLSTMKoyel Ghosh

[email protected]. Apurbalal Senapati

[email protected]. ranjan Maity

[email protected] Department

Central Institute of Technology, Kokrajhar(Assam)

Abstract

Coarse-grained and Fine-grained classificationtasks are mostly based on sentiment or basicemotion analysis. Now, switching from emo-tion and sentiment analysis to another domain,in this paper, we are going to work on technicaldomain identification. The task is to identifythe technical domain of a given English text.In the case of Coarse-grained domain classi-fication, such a piece of text provides infor-mation about specific Coarse-grained techni-cal domains like Computer Science, Physics,Math, etc, and in Fine-grained domain clas-sification, Fine-grained subdomains for Com-puter science domain, it can be like ArtificialIntelligence, Algorithm, Computer Architec-ture, Computer Networks, Database Manage-ment system, etc. To do the task, Word2Vecskip-gram model is used for word embed-ding, later, applied the Bidirectional LongShort Term memory (BiLSTM) model to clas-sify Coarse-grained domains and Fine-grainedsub-domains. To evaluate the performance ofthe approached model accuracy, precision, re-call, and F1-score have been applied.

1 Introduction

ICON20201 has organized a shared task, de-tails here: https://ssmt.iiit.ac.in/techdofication.html where they sharesome DATASETs for the Shared Task on Identi-fication of a Technical Domain from Text. Amongthem, here, we are working with Subtask-1aCoarse-grained Domain Classification - Englishand Subtask-2a Fine-grained Domain Classifica-tion - Computer Science datasets. In this pa-per, system description of our approached modelon identification of a technical domain from textand the result of this approach has been dis-cussed. There are lots of work already have done

1https://www.iitp.ac.in/ai-nlp-ml/icon2020/sharedtasks.html

successfully (Akhtar et al., 2020) in the coarse-grained and fine-grained classification with sen-timent (Cortis et al., 2017) and emotion analysis(Mohammad and Bravo-Marquez, 2017) dataset.Often, in the classification task, Word2Vec or fast-text or GloVe or all-combined approach (Salur andAydin, 2020) is used to utilize the effectivenessof different word embedding algorithms. To get abetter result on a domain specific corpus Occupa-tional Safety and Health Administration(OSHA),a hybrid deep neural network with Word2Vecwas used (Zhang, 2019). ESIM with SuBiL-STM (Ensemble) and ESIM with SuBiLSTM-Tied (Ensemble) approaches (Brahma, 2018) per-formed well on the Stanford Sentiment Treebankdataset (Socher et al., 2013), both in its binary(SST-2) and fine-grained (SST-5) forms. A sim-ilar approach is applied to the question classifi-cation i.e TREC dataset (Voorhees, 2006), bothin its 6 class(TREC-6) and 50 class (TREC-50)forms. Bidirectional dilated LSTM with atten-tion (Schoene et al., 2020) used for another fine-grained dataset (Klinger et al., 2018). (Melamudet al., 2016) proposed a BiLSTM neural networkarchitecture based on Word2vec’s CBOW archi-tecture. Some very old approach on domain clas-sification (Bernier-Colborne et al., 2017). (Zhang,2019) is based on accident causes classificationwith the approach deep learning and Word2Vec,they compare their model with others where bi-gram, n-gram was used for text representation. In(Xie et al., 2019), author added attention layerwith BiLSTM for short text fine-grained sentimentclassification to get a better accuracy.

2 Methodology

In this section, dataset, data preprocessing, wordembedding and the structure of the BiLSTM withWord2Vec model will be discussed. Approachedarchitechture of BiLSTM with Word2Vec isshown in Figure 1

21

Figure 1: Architechture of BiLSTM + Word2Vec withReLU and Softmax on the top

2.1 Dataset

Here, Technical Domain Identification dataset2 inits Coarse-grained Domain Classification - En-glish (Subtask-1a) form and Fine-grained DomainClassification - Computer Science (Subtask-2a)form has been used. Table 1 shows the details ofthe dataset for the shared task. In case of Coarse-grained Domain Classification, a piece of textwhich provides information about specific Coarse-grained domain like computer science domain.No of domains or classes are: Computer Sci-ence (cse), Chemistry (che), Physics (phy), Law(law), Math (math). In case of Fine-grained Do-main Classification, no of subdomains or classesfrom Computer Science are: Computer Architec-ture (ca), Software Engineering (se), Algorithm(algo), Computer Networks (cn), Programming(pro), Artificial Intelligence (ai), Database Man-agement system (dbms). Table 2 shows the de-tails of the domains and subdomains distributionin traning and dev dataset. For prediction purposetest set has been provided without labelling of do-mains or sub domains. So, later in this paper, devset is used to evaluate the accuracy.

2.2 Preprocessing of the Data

Based on the analysis from previous studies, deeplearning needs text data in numeric form. To en-

2https://ssmt.iiit.ac.in/techdofication.html

code text data into a numeric vector, lots of en-coding techniques like Bag of words, Bi-gram, n-gram, TF-IDF, Word2Vec etc are used. So, beforeencoding, text data need to be cleaned, noise-freeto increase the classification performance. Figure2 shows all the intermediate steps of preprocess-ing.

Figure 2: Text preprocessing steps used in this study

Cleaning or preprocessing of the data is as im-portant as model building. Text preprocessing pro-cedure can be different depending on the task anddataset we use. In our case, we used followingsteps:

Convert the text into lowercase: All wordsshould be either in lower or uppercase to avoidredundancy. Suppose there are two words “arti-ficial” and “Artificial”, machine will treat them asseparate word if we avoid this step.

Removing punctuation and number: Punctu-ation and numbers often doesn’t add extra mean-ing to the text. This text has several punctuation( ) , ; . etc and numbers (0-9). String librarywhich has 32 punctuation, is used to remove allthese punctuation from text to get better result.

Removing stopwords: The most commonwords in a language like “the”, “a”, “have”, “is”,“to” etc are called stopwords. As these words donot add any important meaning, these can be re-moved.

Remove frequent word: Some words whichare frequently used in text but not listed in stop-words has been removed.

Word tokenization: To break the long sentenceinto words, we applied tokenization.

22

Dataset Training Dev Test label Length(Max) Vocabulary sizeSubtask-1a 23,959 4,850 2,500 5 74 10,722Subtask-2a 13,580 1,360 1,930 7 266 7,397

Table 1: Data set

Dataset Domain Train Dev

cse 4,770 970

che 4,733 970Subtask-1a physics 4,787 970

law 4,829 970

Math 4,840 970

ca 1,947 180

se 1,940 200

algo 1,951 200Subtask-2a cn 1,940 180

pro 1,922 200

ai 1,940 200

dbms 1,940 200

Table 2: Dataset statistics

Lemmatization: Stemming and lemmatizationboth processes have almost the same goal i.e. toreduce inflectional forms of each word and convertthose to a common root form but both are differ-ent in the sense of result we get. Stemming simplychop off the inflections of each word but some-times the resultant word may not carry any validmeaning but lemmatization does it properly withthe use of language’s full vocabulary to apply amorphological analysis to the words and return thebase or dictionary form of a word, which is knownas the lemma so the words can be analyzed as asingle item.

Here, the effectiveness of lemmatization pro-cess has been applied on the text to get the desiredresult.

Remove short string: After performing all therequired processes in text processing, still, somewords are in the text which is very short in length.So, it required to remove the words having a lengthless than or equal to 2.

Label encoding: As labelled domains and thesubdomains on the texts, are words so, we need toencode them into an unique number. Like, cse - 0,

che - 1, physics - 2, law - 3, math - 4 for subtask1a and ca - 0, se - 1, algo - 2, cn - 3, pro - 4, ai - 5,dbms - 6 for subtask 2a.

Here, we use Natural Language Toolkit(NLTK)(Wagner, 2010) for tokenization, lemmati-zation and removing stopwords. After these stepswe get the maximum length of a sentence i.e max-imum number of words present in a sentence asmentioned in Table 1.

2.3 Deep neural network with Word2VecIn this study, deep neural network with Word2Vecapproach is applied. The entire methodologyof this approach has two phases: training ofWord2Vec skip-gram model on the datasets to getthe vocabulary and the text representation, thendeep neural network is used utilizing the learnedword embedding in the previous step.

2.3.1 Word EmbeddingAny neural network model needs a vector repre-sentation of a word. So, we need an embeddinglayer before building a deep learning model.

Word2Vec models proposed by ThomasMikolov at google (Mikolov et al., 2013), are used

23

for learning word embedding. The advantageof Word2Vec is that similarity and relationshipbetween words can be derived from the learnedvector (Khatua et al., 2019). It can be obtainedusing two methods (both involve neural network):Skip-gram and Common Bag of Words (CBOW).

As mentioned in (Mikolov et al., 2013), skip-gram works great with a small amount of data anddoes well to represent rare words. On the otherhand, CBOW is faster and has better representa-tion for more frequent words.

TechinalDOfication dataset has some rarewords like “streptococcus”, “polymerization”,“hessian”, “kyoto” as these words are related tospecific technical domain. So, to represent thesewords well we are using Word2Vec here.

The training dataset consisting of p numbers oftexts is denoted as

D = {T1, T2, T3, .., Ti, ....Tp}

where Ti is the ith number of text and p is equalto the total numbers of texts present in a trainingdataset e.g. 23,959 in Subtask-1a and 13,580 inSubtask-2a. Given a text Ti, the text having mwords i.e length of the text is denoted as

Ti = {wi,1, wi,2, wi,3, ..., wi,k, .., wi,m},

where wi,k denotes the kth word in the ith text.Now, Word2Vec skip-gram model is trained usingthe training dataset used for this study. To trainword embedding, we fit the parameters as embed-ding dimension = 300, window = 10 and saved thetrained Word2Vec model for the next step.

We embed each word wi,k to our pre-trainedword vector after loading the model into mem-ory i.e each word in the text is converted into ad-dimension embedding vector, where wv

i,k ∈ Rd

is d-dimension embedding vector of kth word. Theword level embedding as

T vi = {wv

i,1, wvi,2, w

vi,3, ..., w

vi,k, ..., w

vi,m}.

Figure 3 shows the Word2Vec architecture, whereH = H1, H2, H3, ..Hn is a hidden layer.

2.3.2 Classification modelLSTM is an extension of Recurrent Neural Net-work (RNN) (Hochreiter and Schmidhuber, 1997),capable of learning long dependencies. They wereintroduced by (Sulehria and Zhang, 2007).

In this section, deep neural network BiLSTMis used for the classification. Now, we give T v

i

Figure 3: Word2Vec architechture

as input to BiLSTM for feature extraction, namelyFBiLSTMi in equation

FBiLSTMi = BiLSTM(T v

i ) (1)

In Bidirectional LSTM, sequence data is pro-

Figure 4: The architecture of basic BiLSTM

cessed in both directions with forward LSTM andbackward LSTM layer and these two hidden layerconnected to the same output layer. The LSTMneural networks contain three gates and a cellmemory state. For a single LSTM cell, it can becomputed as

X =ht − 1

wvi,k

(2)

ft = σ(Wf .X + bf ) (3)

it = σ(Wi.X + bi) (4)

ot = σ(Wo.X + bo) (5)

ct = ft ∗ ct−1 + it ∗ tanh(Wc.X + bc) (6)

ht = ot ∗ tanh(ct) (7)

24

where Wf ,Wi,W0 are the weight matrices andbf , bi, b0 are the bias of LSTM cell during training.σ denotes the sigmoid function. wv

i,k is the wordembedding vector as input unit to LSTM, ht is thehidden vector, So, hm can denote a text. SimpleBiLSTM architecture is shown in Figure 4. In thearchitecture {wv

i,1, wvi,2, w

vi,3, ..., w

vi,k, ..., w

vi,m}

denotes the word vector, m is the length of a text.{fh1, fh2, ..., fhm} and {bh1, bh2, ..., bhm}represent the forward hidden vector andthe backward hidden vector respectively.{h1, h2, h3, .., ht, .., hm} represents final hid-den layer. the final hidden vector ht of theBiLSTM is shown as following equation:

ht = [fht, bht] (8)

In the BiLSTM layer, 20% dropout is used.After feeding input to BiLSTM layer, time Dis-tributed wrapper is used along with dense layerwhere the activation function is rectified linearunit (ReLU) and on the top of the layers denseis applied with softmax activation function afterFlatten the output generated from the previouslayer.

3 Result and conclusion

Domain Precision Recall f1-scorecse 0.26 0.33 0.29che 0.17 0.21 0.19

physics 0.23 0.19 0.21law 0.22 0.16 0.19

math 0.16 0.15 0.16

Table 3: Result of CITK (our team) on Subtask-1adataset (dev set)

Domain Precision Recall f1-scoreca 0.29 0.23 0.26se 0.16 0.17 0.16

algo 0.11 0.14 0.13cn 0.14 0.14 0.14pro 0.18 0.17 0.18ai 0.18 0.20 0.19

dbms 0.23 0.20 0.22

Table 4: Result of CITK (our team) on Subtask-2adataset (dev set)

As, it was a prediction task on test dataset andpresently, we don’t have labeled test dataset, devdataset is used here to evaluate the model perfor-mance. From the Table 3 and Table 4, we cansee that BiLSTM with Word2Vec didn’t produce

any good Precision, Recall, f1-score and accurecy22% on both cases which are also very low. Toevaluate the model performance, F1 score pro-posed by Buckland and Gey (Buckland and Gey,1994) has been widely used in literature.

Team Name Accuracy Precision Recall f1-scoreICON2020 0.8156 0.8155 0.8156 0.8143

CITK 0.2204 0.2264 0.2204 0.2204

Table 5: comparison between highest score and ourscore(CITK) on Subtask-1a dataset ( test set )

Team Name Accuracy Precision Recall f1-scorefineapples 0.8252 0.8265 0.8252 0.8244

CITK 0.2306 0.2344 0.2302 0.2307

Table 6: comparison between highest score and ourscore(CITK) on Subtask-2a dataset ( test set )

Table 5 and Table 6 shows the result on testdataset published by ICON2020. Here, we onlyinclude highest score along with our score toshow the comparison of the performances. Incase of Subtask-1a dataset, “ICON2020” teamproduced good accuracy, precision, recall andf1-score compared to other teams including our“CITK” team. “Fineapples” team produced bestresult for Subtask-2a dataset.

After applying preprocessing on the texts ofSubtask 2a dataset, we get maximum sentencelength is 266 but other texts are not that long ex-cept one. Here, we trained Word2Vec model onlywith the given training dataset which is very smalldataset to perform good embedding. Word em-bedding training gives us 10,722 unique words insubtask 1a and 7,397 in subtask 2a. Quite obvi-ous, dealing with Fine grained dataset compare toCoarse grained dataset is somehow challenging asvocabulary size is very small.

Word embedding seems very important here,In most of the cases pre-trained word embed-ding such as Google News dataset3 (about100 billion words) is used, which contains300-dimensional vectors for 3 million wordsand phrases. The archive is available here:GoogleNews-vectors-negative300.bin.gz. but in our case, some words like“streptococcus”, “poly-merization”,“ hessian”,“kyoto” are missing from the large Google Newsdataset. Those words are very important to

3https://code.google.com/archive/p/word2vec/

25

identify the specific domains, so, those wordscan’t be ignored. Combining Google Newsdataset and ICON2020 shared task dataset withcan be a solution that needs some experimentssuch as concatenating them removing possiblecorrelations by performing Principal componentanalysis (PCA)(Basirat, 2018).

ReferencesM. S. Akhtar, A. Ekbal, and E. Cambria. 2020. How

intense are you? predicting intensities of emotionsand sentiments using stacked ensemble [applicationnotes]. IEEE Computational Intelligence Magazine,15(1):64–75.

Ali Basirat. 2018. A generalized principal componentanalysis for word embedding.

Gabriel Bernier-Colborne, Caroline Barriere, andPierre Andre Menard. 2017. Fine-grained domainclassification of text using termium plus.

Siddhartha Brahma. 2018. Improved sentence model-ing using suffix bidirectional lstm. arXiv: Learning.

Michael Buckland and Fredric Gey. 1994. The rela-tionship between recall and precision. J. Am. Soc.Inf. Sci., 45(1):12–19.

Keith Cortis, Andre Freitas, Tobias Daudert, ManuelaHurlimann, Manel Zarrouk, Siegfried Handschuh,and Brian Davis. 2017. Semeval-2017 task 5: Fine-grained sentiment analysis on financial microblogsand news.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural Computation,9(8):1735–1780.

Aparup Khatua, Apalak Khatua, and Erik Cambria.2019. A tale of two epidemics: Contextualword2vec for classifying twitter streams during out-breaks. Information Processing Management,56:247–257.

Roman Klinger, Orphee de clercq, Saif Mohammad,and Alexandra Balahur. 2018. Iest: Wassa-2018 im-plicit emotions shared task.

Oren Melamud, Jacob Goldberger, and Ido Dagan.2016. context2vec: Learning generic context em-bedding with bidirectional lstm. pages 51–61.

Tomas Mikolov, Ilya Sutskever, Kai Chen, G.s Cor-rado, and Jeffrey Dean. 2013. Distributed represen-tations of words and phrases and their composition-ality. Advances in Neural Information ProcessingSystems, 26.

Saif Mohammad and Felipe Bravo-Marquez. 2017.Wassa-2017 shared task on emotion intensity. pages34–49.

Mehmet Salur and Ilhan Aydin. 2020. A novel hy-brid deep learning model for sentiment classifica-tion. IEEE Access, 8:58080 – 58093.

Annika Marie Schoene, Alexander P. Turner, and NinaDethlefs. 2020. Bidirectional dilated lstm withattention for fine-grained emotion classification intweets. In AffCon@AAAI.

Richard Socher, A. Perelygin, J.Y. Wu, J. Chuang, C.D.Manning, A.Y. Ng, and C. Potts. 2013. Recursivedeep models for semantic compositionality over asentiment treebank. EMNLP, 1631:1631–1642.

Humayun Karim Sulehria and Ye Zhang. 2007. Hop-field neural networks: A survey. In Proceedingsof the 6th Conference on 6th WSEAS Int. Conf. onArtificial Intelligence, Knowledge Engineering andData Bases - Volume 6, AIKED’07, page 125–130,Stevens Point, Wisconsin, USA. World Scientificand Engineering Academy and Society (WSEAS).

Ellen Voorhees. 2006. The trec question answeringtrack. Nat. Lang. Eng, 7:361–378.

Wiebke Wagner. 2010. Natural language process-ing with python, analyzing text with the naturallanguage toolkit by steven bird; ewan klein; ed-ward loper. Language Resources and Evaluation,44:421–424.

Jun Xie, Bo Chen, Xinglong Gu, Fengmei Liang,and Xinying Xu. 2019. Self-attention-based bilstmmodel for short text fine-grained sentiment classifi-cation. IEEE Access, 7:1–1.

Fan Zhang. 2019. A hybrid structured deep neuralnetwork with word2vec for construction accidentcauses classification. International Journal of Con-struction Management, pages 1–21.

26


Automatic Technical Domain Identification

Hema AlaLTRC, IIIT-Hyderabad, India

[email protected]

Dipti Misra SharmaLTRC, IIIT-Hyderabad, [email protected]

Abstract

In this paper we present two Machine Learn-ing algorithms namely Stochastic Gradient De-scent and Multi Layer Perceptron to Identifythe technical domain of given text as suchtext provides information about the specificdomain. We performed our experiments onCoarse-grained technical domains like Com-puter Science, Physics, Law, etc for English,Bengali, Gujarati, Hindi, Malayalam, Marathi,Tamil, and Telugu languages, and on fine-grained sub domains for Computer Sciencelike Operating System, Computer Network,Database etc for only English language. Us-ing TFIDF as a feature extraction method weshow how both the machine learning modelsperform on the mentioned languages.

1 Introduction

We can frame Automatic Domain Identification ofgiven text as a text classification problem whereone needs to assign predefined categories to giventexts. Text classification is a classic topic for Natu-ral Language Processing (NLP), the range of textclassification research goes from designing the bestfeatures to choosing the best possible machinelearning classifiers. Therefore we use Term Fre-quency & Inverse Document Frequency (TFIDF)to represent our text in terms of vectors, so thatthe machine learning algorithms will find the re-lationships between them and classifies the giventext. Many machine learning algorithms showedthe best performances on text classification , butvery limited number studies have explored techni-cal domains like computer science, chemistry, man-agement, etc that too on Indian languages (Kaurand Saini, 2015). There are numerous applicationsof text classification in Natural Language Process-ing Tasks like Machine Translation,etc. For thesetasks technical domain identification would be thefirst process. It determines the domain for a given

input text, subsequently Machine Translation canchoose its resources as per the identified domain.The task can also be viewed at the coarse-grainedor fine-grained level based on the requirement. Wedid our experiments on data provided by ICONTechDOfication-2020 shared task, for English, Ben-gali, Gujarati, Hindi , Malayalam , Marathi, Tamiland Telugu languages for coarse grained domainclassification . For fin-grained classification wehave Computer Science domain in English.

2 Related Work

We treat Automatic Technical Domain Identifica-tion as a text classification task where we assignpredefined categories like chem for chemistry, csfor computer science for the given text. Text classi-fication is a fundamental task in NLP applicationsand it is a crucial technology in many applications,such as web search, ads matching, and sentimentanalysis. Many researchers found variety of algo-rithms to solve the text classification problem.

The algorithms will vary based on the langaugeof text and domain of the text as well. McCal-lum et al. (1998) compared the theory and prac-tice of two different first-order probabilistic classi-fiers, both of which make the naive Bayes assump-tion. The multinomial model is found to be almostuniformly better than the multi-variate Bernoullimodel. Joachims (1999) introduced Transduc-tive Support Vector Machine for text classification.While general Support Vector Machines (SVMs)try to produce a general decision function for alearning task, Transductive Support Vector Ma-chines take a particular test set into account andtry to minimize misclassification of just those par-ticular samples. Nigam et al. (1999) used maxi-mum entropy for text classification by computingthe conditional distribution of the class given thetext, and compared accuracy to naive Bayes and

27

showed that maximum entropy is sometimes sig-nificantly better, but also sometimes worse. Lodhiet al. (2002) proposed a novel approach for catego-rizing text documents based on the use of a specialkernel called string subsequence kernel. Machinelearning for text classification is the foundation ofdocument categorization, news filtering, documentrouting, and personalization.

In text domains, effective feature selection is cru-cial to make the learning task efficient and moreaccurate, based on this point Forman (2003) pre-sented an extensive comparative study of twelvefeature selection metrics like Document Frequency,etc for the high-dimensional domain of text classi-fication, focusing on support vector machines and2-class problems, typically with high class skew.In social media such as Twitter, Facebook the usersmay become overwhelmed by the raw data. Onesolution to this problem is the classification of shorttext, In Sriram et al. (2010) they did the same, theyproposed an approach to use a small set of domain-specific features extracted from the author’s profileand text to classify the text to a predefined set ofgeneric classes such as News, Events, Opinions,Deals, and Private Messages. Apart from machinelearning algorithms there are some deep learningtechniques as well for text classification. In con-trast to traditional methods,Lai et al. (2015) intro-duced a recurrent convolutional neural network fortext classification without human designed features.In their model, they apply a recurrent structure tocapture contextual information as far as possiblewhen learning word representations, which mayintroduce considerably less noise compared to tra-ditional window-based neural networks.

Conneau et al. (2016) presented a new architec-ture (VD-CNN) for text processing which operatesdirectly at the character level and uses only smallconvolutions and pooling operations. Joulin et al.(2016) used fasttext for word features and thenaveraged to get a sentence representation for textclassification. Yao et al. (2019)proposed a novelapproach for text clasiification termed as GraphConvolutional Networks termed as Text-GCN, itcan capture global co-occurence information anduses limited labelled texts/documents well. Thoughthere exists a lot of work on text classification, veryfew works are done for technical domains and onIndian languages like ours. Therefore we presentour approach on the provided Indian Languagesalong with technical domains.

3 Approach

We evaluate our two models namely, Stochas-tic Gradient Decent and Multi Layer Perceptronon technical domains(Chemistry,CommunicationTechnology, Computer Science, Law , Math andPhysics,Bio-Chemistry, Management) for coarsegrained technical domain classification for allabove mentioned languages(though the number ofdomains may differ from language to language).For fine grained technical domain classification wehave only Computer science in which sub-domainsinclude AI, Algorithm , Computer Architecture,Computer Networks , Database Management sys-tem , Programming and Software Engineering forEnglish. We used TFIDF for all experiments.

3.1 Term Frequency & Inverse DocumentFrequency (TF-IDF)

We use TF-IDF as our feature extraction method inour experiments, The most basic form of weightedword feature extraction is Term frequency (Saltonand Buckley, 1988) TF, where each word is mappedto a number corresponding to the number of occur-rences of that word in the whole corpora. Methodsthat extend the results of TF generally use wordfrequency as a boolean or logarithmically scaledweighting.

W (d, t) = TF (d, t) ∗ log( N

df(t)) (1)

(Jones, 1972) proposed Inverse Document Fre-quency (IDF) as a method to be used along withterm frequency in order to lessen the effect of im-plicitly common words in the corpus. IDF assignsa higher weight to words with either high or lowfrequency term in the document. This combinationof TF and IDF is well known as Term Frequency-Inverse document frequency (TF-IDF). The mathe-matical representation of the weight of a term in adocument by TF-IDF is given in Equation 1. HereN is the number of documents and df(t) is thenumber of documents containing the term t in thecorpus. The first term in equation 1 improves therecall while the second term improves the preci-sion.

3.2 Stochastic Gradient Decent (SGD)

We used SGD classifier from scikit-learn (Pe-dregosa et al., 2011). SGD has been successfullyapplied to large-scale and sparse machine learningproblems often encountered in text classification

28

and natural language processing. Though SGDis an optimizer it’s alone can be used as a classi-fier for text classification using different loss func-tions. The class SGDClassifier implements a plainstochastic gradient descent learning routine whichsupports different loss functions and penalties forclassification. We performed our experiments withhinge loss which is equivalent to linear SupportVector Machine (SVM).

3.3 Multi Layer Perceptron

For Multi Layer Perceptron classifier which in thename itself connects to a Neural Network. Unlikeother classification algorithms such as Support Vec-tors or Naive Bayes Classifier, MLPClassifier relieson an underlying Neural Network to perform thetask of text classification. MLPClassifier trains it-eratively, at each time step the partial derivatives ofthe loss function with respect to the model parame-ters are computed to update the parameters. It canalso have a regularization term added to the lossfunction that shrinks model parameters to preventoverfitting. In our experiments we used MLPClas-sifier from (Pedregosa et al., 2011).

4 Experiments & Results

We evaluate two machine learning algorithms onData provided by ICON TechDOfication-2020shared task. The data statistics in terms of numberof sentences for all languages is mentioned in table1. We are provided with various technical domainslike physics chemistry etc by ICON TechDOfica-tion 2020 shared task for mentioned languages,however the domains in each language are differ-ent. We have Physics(phy), Maths(math), Chem-istry(che), Law(law) and Computer Science(cse)n English. Similarly Bengali and Gujarathihave BioChemistry(bioche), cse, CommunicationTechnology(com-tech), Management(mgmt) andphy. Hindi and Telugu have bioche,cse , phy, mgmt,com-tech and other, where Hindi has extra mathdomain. Malayalam has cse, bioche, com-tech do-mains, and for Marathi we have bioche, com-tech,phy and cse. In fine-grained domain identificationlike identifying sub-domain of Computer Science,we have AI (ai),Algorithm (algo),Computer Archi-tecture (ca), Computer Networks (cn), DatabaseManagement system (dbms),Programming (pro)and Software Engineering (se) subdomains.

As mentioned in section 3.1 we used TFIDF forall experiments in this paper. For SGD classifier

we used hinge loss, and we took alpha as 0.00001,it is a constant that multiplies the regularizationterm. The higher the value, the stronger the regu-larization. Maximum number of iterations takenfor this algorithm is 15. In MLPClassifier we usedrelu activation function, solver as sgd which usedto find the gradients and optimize the loss function.We adopted the same alpha as SGD Classifier. AsMLP is neural network based classifier, there is aneed to give hidden layer sizes, we used [100,90]for two hidden layers apart from input and outputlayer.

Lang. Train Dev TestEnglish 23962 4850 2500Bengali 58500 5843 1923Gujarati 36009 5724 2683Hindi 148445 14338 4212

Malayalam 40669 3390 1515Marathi 41997 3780 1789Tamil 72483 6190 2071Telugu 68865 5920 2612

English(CS) 13580 1360 1930

Table 1: Data Statistics (no. of sentences)English(CS) is fine-grained classification task for Com-puter Science Domain in English

Lang. Acc. P R F1English 0.76 0.76 0.76 0.76Bengali 0.66 0.71 0.68 0.66Gujarati 0.58 0.57 0.58 0.57Hindi 0.43 0.44 0.41 0.40

Malayalam 0.44 0.47 0.4 0.37Marathi 0.48 0.5 0.47 0.43Tamil 0.44 0.43 0.45 0.36Telugu 0.55 0.6 0.56 0.57

English(CS) 0.70 0.70 0.70 0.70

Table 2: Classification Report for SGD ClassfierEnglish(CS) is fine-grained classification task for Com-puter Science Domain in EnglishAcc:Accuracy P:Precision R:Recall F1:F1-score

We present Accuracy, Precision, Recall and F1-score for all the tasks(for all mentioned languages)as shown in table 2 and table 3 for SGD classi-fier and for MLP classifier respectively. If we ob-serve the results both the models performed wellon English compared to other languages. Moti-vated from this our future work will be to improvethe accuracy on Indian languages. MLP classifieroutperformed SGD in almost all tasks. If we talk

29

Lang. Acc. P R F1English 0.77 0.77 0.77 0.77Bengali 0.66 0.70 0.68 0.66Gujarati 0.60 59 0.6 0.58Hindi 0.43 0.5 0.42 0.43

Malayalam 0.44 0.47 0.38 0.36Marathi 0.5 0.5 0.48 0.44Tamil 0.45 0.44 0.5 0.39Telugu 0.54 0.6 0.51 0.52

English(CS) 0.62 0.64 0.62 0.63

Table 3: Classification Report for MLP ClassfierEnglish(CS) is fine-grained classification task for Com-puter Science Domain in EnglishAcc:Accuracy P:Precision R:Recall F1:F1-score

about fine grained technical domain identification,SGD outperformed MLP classifier. ComparativelyMalayalam and Tamil got less scores in both thealgorithms. From all the experiments we can con-clude that we can use MLP classifier for TechnicalDomain Identification but still there is a huge needof improving or coming up with new algorithmsfor morphologically rich Indian languages.

5 Conclusion & Future Work

we are in the process of exploring many differentalgorithms for Technical Domain Identification. Inthe future we want to work on other possible lan-guages for possible technical domains. In this paperwe showed two machine learning algorithms(SGDand MLP). TFIDF doesn’t depend on any languageor domain specific resources hence, we preferredTFIDF as feature extraction method for both theML algorithms presented in the experiments. Fromthe results we can conclude that Multi Layer Per-ceptron is performing better on these technical do-mains for the provided languages.

ReferencesAlexis Conneau, Holger Schwenk, Loıc Barrault,

and Yann Lecun. 2016. Very deep convolutionalnetworks for text classification. arXiv preprintarXiv:1606.01781.

George Forman. 2003. An extensive empirical studyof feature selection metrics for text classification.Journal of machine learning research, 3(Mar):1289–1305.

Thorsten Joachims. 1999. Transductive inference fortext classification using support vector machines. InIcml, volume 99, pages 200–209.

Karen Sparck Jones. 1972. A statistical interpretationof term specificity and its application in retrieval.Journal of documentation.

Armand Joulin, Edouard Grave, Piotr Bojanowski, andTomas Mikolov. 2016. Bag of tricks for efficient textclassification. arXiv preprint arXiv:1607.01759.

Jasleen Kaur and Jatinderkumar R Saini. 2015. A studyof text classification natural language processing al-gorithms for indian languages. The VNSGU Journalof Science Technology, 4(1):162–167.

Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015.Recurrent convolutional neural networks for textclassification. In Twenty-ninth AAAI conference onartificial intelligence.

Huma Lodhi, Craig Saunders, John Shawe-Taylor,Nello Cristianini, and Chris Watkins. 2002. Textclassification using string kernels. Journal of Ma-chine Learning Research, 2(Feb):419–444.

Andrew McCallum, Kamal Nigam, et al. 1998. A com-parison of event models for naive bayes text classi-fication. In AAAI-98 workshop on learning for textcategorization, volume 752, pages 41–48. Citeseer.

Kamal Nigam, John Lafferty, and Andrew McCallum.1999. Using maximum entropy for text classifica-tion. In IJCAI-99 workshop on machine learning forinformation filtering, volume 1, pages 61–67. Stock-holom, Sweden.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,D. Cournapeau, M. Brucher, M. Perrot, and E. Duch-esnay. 2011. Scikit-learn: Machine learning inPython. Journal of Machine Learning Research,12:2825–2830.

Gerard Salton and Christopher Buckley. 1988. Term-weighting approaches in automatic text retrieval. In-formation processing & management, 24(5):513–523.

Bharath Sriram, Dave Fuhry, Engin Demir, Hakan Fer-hatosmanoglu, and Murat Demirbas. 2010. Shorttext classification in twitter to improve informationfiltering. In Proceedings of the 33rd internationalACM SIGIR conference on Research and develop-ment in information retrieval, pages 841–842.

Liang Yao, Chengsheng Mao, and Yuan Luo. 2019.Graph convolutional networks for text classification.In Proceedings of the AAAI Conference on ArtificialIntelligence, volume 33, pages 7370–7377.

30


Fine-grained domain classification using Transformers

Akshat Gahoi Akshat Chhajer Dipti Mishra Sharma

Language Technologies Research CenterInternational Institute of Information Technology, Hyderabad, India

{akshat.gahoi,akshat.chhajer}@[email protected]

Abstract

The introduction of transformers in 2017 andsuccessively BERT in 2018 brought about arevolution in the field of natural language pro-cessing. Such models are pretrained on vastamounts of data, and are easily extensible to beused for a wide variety of tasks through trans-fer learning. Continual work on transformerbased architectures has led to a variety of newmodels with state of the art results. RoBERTa(Liu et al., 2019) is one such model, whichbrings about a series of changes to the BERT(Devlin et al., 2018) architecture and is capa-ble of producing better quality embeddings atan expense of functionality. In this paper, weattempt to solve the well known text classifi-cation task of fine-grained domain classifica-tion using BERT and RoBERTa and performa comparative analysis of the same. We alsoattempt to evaluate the impact of data prepro-cessing specially in the context of fine-graineddomain classification.The results obtained outperformed all the othermodels at the ICON TechDOfication 2020(subtask-2a) Fine-grained domain classifica-tion task and ranked first. This proves the ef-fectiveness of our approach.

1 Introduction

The transformer-based language models have beenshowing promising progress on a number of differ-ent natural language processing (NLP) benchmarks.The combination of transfer learning methods withlarge-scale transformer language models is becom-ing a standard in modern NLP and has resulted inmany state-of-the-art models.Compared to LSTMs(Greff et al., 2015), the mainlimitations of bidirectional LSTMs is their sequen-tial nature, which makes training in parallel verydifficult. The transformer architecture solves thatby completely replacing LSTMs by the so-calledattention mechanism (Vaswani et al., 2017). With

attention, we are seeing an entire sequence as awhole, therefore it is much easier to train in paral-lel.Text classification is a well-known task in NaturalLanguage Processing, which aims at automaticallyproviding additional document-level metadata (e.g.domain, genre, author).

One of text classification tasks: domain classifi-cation can be divided into two categories:-

• Course-grained domain classification

• Fine-grained domain classification

Course-grained domain classification aims toclassify the input into varied and unrelated domainssuch as chemistry, law, computer science etc. Onthe other hand, fine-grained domain classificationaims to classify the input into closely related sub-domains under a higher level domain. Exampleincludes classification of text on physics into dif-ferent topics like relativity, mechaincs, or quantummechanics. The latter is found to be a significantlymore challenging task due to the similarity and lackof distinction between the inputs attributed to thefact that they are under an umbrella of a commondomain.Although there is substantive work done on domainclassification in general (Young-Bum Kim, 2018),there has been less emphasis on fine-grained do-main classification and the various augmentationsto data that can be done to achieve higher perfor-mances in that task. This paper looks at the taskof fine-grained domain classification in context oftransformers. It paper will provide a comparisonbetween the widely used BERT and RoBERTaembeddings for the task as well as attempt to ob-serve the impact of data pre-processing in the con-text of fine-grained domain classification.On blind test corpora of 1929 text samples, theproposed model in this paper led to F1 score of

31

0.824 at the ICON TechDOfication 2020 sharedtask (subtask-2a). This result helped us bag theleading position on the leaderboard.

2 Dataset

For the study, dataset from the ICON TechDOfica-tion 2020 (subtask-2a) was used. The entire col-lection consisted of 14910 text samples from En-glish spanning across 7 sub-domains of ComputerScience. Table 1 shows the sub-domains and thedistribution of data.

Sub-domain Code SamplesArtificial Intelligence ai 2140Algorithm algo 2131Computer Architecture ca 2127Computer Networks cn 2140Database ManagementSystem dbms 2140

Programming pro 2122Software Engineering se 2140

Table 1: Dataset distribution across domains

The average number of characters in the textsamples was 177.6 and the average number of to-kens observed in the same was 36.3.

A collection of 1929 text samples served as theblind test set for this task.

3 BERT vs RoBERTa

BERT is a bi-directional transformer for pre-training over huge amount of unlabeled textualdata to learn a language representation. It can bethen used to fine-tune for specific machine learn-ing tasks like text classification. BERT outper-formed the NLP state-of-the-art on several chal-lenging tasks, attributed to the bidirectional trans-former, novel pre-training tasks of Masked Lan-guage Model(Song et al., 2019) and Next SentencePrediction(Shi and Demberg, 2019).

RoBERTa has a very similar architecture as com-pared to BERT with improved training method-ology and more data. To improve the training,RoBERTa removes the Next Sentence Predictiontask from BERT’s pre-training and introduces dy-namic masking so that the masked token changesduring the training epochs. Originally BERT istrained for 1M steps with a batch size of 256 se-quences. RoBERTa on the other hand is trainedwith 125 steps of 2K sequences and 31K steps with8K sequences of batch size. Large batches are alsoeasier to parallelize via distributed parallel training.

(a) BERT with no preprocessing

(b) RoBERTa with no preprocessing

Figure 1: Confusion matrices for dev set without pre-processing

4 System Overview

The section presents an overview of the systemwhich was used to evaluate the scores described inthe Results section of the paper.

4.1 Pre-ProcessingIn the first approach, only one-hot-encoding for thelabels was done and the raw text was fed as it is toboth the transformers.In the second approach the raw data was pre-processed keeping in mind the nature of fine-grained domain classification task. First, tokeniza-tion was done on the text using spaCy and the stopwords were filtered out. Next, the tokens werepassed through a counter and the top 20 tokensfrom the entire corpus were identified and then re-moved. As domain classification relies more on thekeywords than the sentence structures, the data wascleaned. Lastly, the text was reconstructed fromthe remaining tokens. This was done to reduce thegeneralization amongst the sub-domains as the texthad a lot of common terms from the higher levelcomputer science domain itself.

4.2 TrainingIn total, 4 models were trained usingBERT/RoBERTa and with/without pre-processing.

32

(a) BERT with preprocessing

(b) RoBERTa with preprocessing

Figure 2: Confusion matrices for dev set with prepro-cessing

The training was done using the concept of transferlearning. The pretrained bert-base-uncased androberta-bert were taken and further fine tuning onit was done using the training dataset. The learningrate used was 4e-5 with 128 batch size. Each ofthe models were trained for 4 epochs.

5 Results and Evaluation

All the models were evaluated on the dev datasetand the results are presented in Table 2. It is clearthat pre-processing indeed increases the f1 scoreand makes a substantive difference in fine-graineddomain classification. This is because the commonterms from the higher level common domains areremoved and more distinction is created in the textsamples for sub-domains.

In Figure 1 (results on dev set without prepro-cessing), we can see that RoBERTa miss classifiesonly slightly a less number text samples comparedto BERT with both performing very similar. How-ever, there is difference seen when preprocessingis done and frequent words are removed. In Fig-ure 2 (results on dev set with preprocessing), wecan see that BERT miss-classifies 91 text samplesas ’Database Management System’ which reducesto 48 when using RoBERTa. Similarly, 87 miss-

classifications done by BERT as ’Programming’are corrected to 54 by RoBERTa. However it isseen that RoBERTa tends to miss-classify text sam-ples as ’Software Engineering’ often.

In both the cases, RoBERTa performed betterthan BERT however, with a small margin. The bestperforming model (RoBERTa with preprocessing)was then evaluated using the ICON TechDOfica-tion 2020 (subtask-2a) test dataset. The resultsobtained are shown in Table 3. It was observedthat for different models also documents getsmisidentified between a common pair of domainshence defining a close relation between the twodomains. So this experiment can also be done todetermine two closely related domains among ahuge variety of domains.

Transformer Pre-processed Precision Recall F1

bert-base-uncased no 0.761 0.752 0.756

roberta-base no 0.788 0.783 0.781bert-base-uncased yes 0.832 0.812 0.810

roberta-base yes 0.842 0.837 0.835

Table 2: Results for the dev set

Transformer Accuracy Precision Recall F1RoBERTa 0.825 0.826 0.825 0.824

Table 3: Final model results on the test dataset

6 Conclusion

In this paper, we did a comparative analysis ofBERT and RoBERTa in the context of fine-graineddomain classification. Furthermore, the impact ofpre-processing was also explored. It was found thatpre-processing and removal of common terms fromdata helps the model perform better as more dis-tinction is created between the sub-domains. Theresults indicate that RoBERTa performs slightlybetter than BERT in all the cases.

The model proposed in this paper ranked first inthe ICON TechDOfication 2020 (subtask-2a) withan F1 score of 0.824.

7 Future Work

This paper shows how good transformers can per-form for the task of multi class text classification.The main difference in the results comes from the

33

embeddings being used. Thus, a very high perform-ing multilingual model can be created if enoughdata and pre-trained language models are availablefor Indian languages. Hence, the goal can be tocreate BERT embeddings for other languages andextend the work done by AI4Bharat/IndicBERT(Kakwani et al., 2020). This can then not onlybe used for text classification tasks but also anyother multi-lingual state-of-the-art natural process-ing task.

ReferencesJacob Devlin, Ming-Wei Chang, Kenton Lee, and

Kristina Toutanova. 2018. BERT: pre-training ofdeep bidirectional transformers for language under-standing. CoRR, abs/1810.04805.

Klaus Greff, Rupesh Kumar Srivastava, Jan Koutnık,Bas R. Steunebrink, and Jurgen Schmidhuber.2015. LSTM: A search space odyssey. CoRR,abs/1503.04069.

Divyanshu Kakwani, Anoop Kunchukuttan, SatishGolla, Gokul N.C., Avik Bhattacharyya, Mitesh M.Khapra, and Pratyush Kumar. 2020. IndicNLPSuite:Monolingual corpora, evaluation benchmarks andpre-trained multilingual language models for Indianlanguages. In Findings of the Association for Com-putational Linguistics: EMNLP 2020, pages 4948–4961, Online. Association for Computational Lin-guistics.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,Luke Zettlemoyer, and Veselin Stoyanov. 2019.Roberta: A robustly optimized BERT pretraining ap-proach. CoRR, abs/1907.11692.

Wei Shi and Vera Demberg. 2019. Next sentence pre-diction helps implicit discourse relation classifica-tion within and across domains. In Proceedings ofthe 2019 Conference on Empirical Methods in Nat-ural Language Processing and the 9th InternationalJoint Conference on Natural Language Processing(EMNLP-IJCNLP), pages 5790–5796, Hong Kong,China. Association for Computational Linguistics.

Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu. 2019. MASS: masked sequence to se-quence pre-training for language generation. CoRR,abs/1905.02450.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need. CoRR, abs/1706.03762.

Anjishnu Kumar Ruhi Sarikaya Young-Bum Kim,Dongchan Kim. 2018. Efficient large-scale neuraldomain classification with personalized attention.

34


TechTexC: Classification of Technical Texts using Convolution andBidirectional Long Short Term Memory Network

Omar Sharif†, Eftekhar Hossain*, and Mohammed Moshiul Hoque†

†Department of Computer Science and Engineering*Department of Electronics and Telecommunication Engineering

Chittagong University of Engineering and Technology, Bangladesh†{omar.sharif, moshiul 240}@cuet.ac.bd

*[email protected]

Abstract

This paper illustrates the details descriptionof technical text classification system andits results that developed as a part ofparticipation in the shared task TechDofication2020. The shared task consists of twosub-tasks: (i) first task identify the coarse-grained technical domain of given text ina specified language and (ii) the secondtask classify a text of computer sciencedomain into fine-grained sub-domains. Aclassification system (called ’TechTexC’) isdeveloped to perform the classification taskusing three techniques: convolution neuralnetwork (CNN), bidirectional long short termmemory (BiLSTM) network, and combinedCNN with BiLSTM. Results show that CNNwith BiLSTM model outperforms the othertechniques concerning task-1 of sub-tasks (a,b, c and g) and task-2a. This combinedmodel obtained f1 scores of 82.63 (sub-taska), 81.95 (sub-task b), 82.39 (sub-task c),84.37 (sub-task g), and 67.44 (task-2a) on thedevelopment dataset. Moreover, in the caseof test set, the combined CNN with BiLSTMapproach achieved that higher accuracy forthe subtasks 1a (70.76%), 1b (79.97%), 1c(65.45%), 1g (49.23%) and 2a (70.14%).

1 Introduction

Due to the substantial growth and effortless accessto the Internet in recent years, an enormous amountof unstructured textual contents have generated. Itis a crucial task to organize or structure such avoluminous unstructured text in manually. Thus,automatic classification can be useful to manipulatea huge amount of texts, and extract meaningfulinsights which save a lot of time and money.Text categorization is a classical NLP problemwhich aims to categorize texts into organizedgroups. It has a wide range of applicationslike machine translation, question answering,

summarization, and sentiment analysis. Thereare several approaches available to classify textsaccording to their labels. However, deep learningmethod outperforms the rule-based and machinelearning-based models because of their ability tocapture sequential and semantic information fromtexts (Minaee et al., 2020). We propose a classifierusing CNN (Jacovi et al., 2018), and BiLSTM(Zhou et al., 2016) to classify technical texts inthe computer science domain. Furthermore, bysequentially adding these networks, remarkableaccuracy in several shared classification tasks canbe obtained. The rest of the paper is organized asfollows: related work given in section 2. Section 3describes the dataset. The framework described insection 4. The findings presented in section 5.

2 Related Work

CNN and LSTM have achieved great success invarious NLP tasks such as sentence classification,document categorization, sentiment analysis, andsummarization. Kim (2014) used convolutionneural network to classify sentences. A methodused contents and citations to classify scientificdocument (Cao and Gao). Zhou et al. (2016)used 2-D max pooling and bidirectional LSTMto classify texts. Zhou et al. (2015) combinedCNN and LSTM to classify sentiment and questiontype. Their system achieved superior accuracythan CNN and LSTM individually. Hossain et al.(2020) used LSTM to classify sentiment of Bengalitext documents. Their system got maximumaccuracy with one layer of LSTM followed bythree dense layers. Ranjan et al. (2017) proposeda document classification framework using LSTMand feature selection algorithms. Ameur et al.(2020) combined CNN and RNN methods tocategorize Arabic texts. They used dynamic, fine-tuned words embedding to get effective result on

35

open-source Arabic dataset.

3 Dataset

To develop the classifier model, we used thedataset provided by the organizers of the sharedtask1. This shared task consists of two subtasks:subtask-1 and subtask-2. Subtask-1 aims tothe identification of coarse-grained domain for apiece of text. Organizers provided data includingeight different languages (English, Bangla, Hindi,Gujarati, Malayalam, Marathi, Tamil, Telugu)each having a different number of classes for thistask. In subtask-2, the goal is to find the fine-grained sub domain of a text from the computerscience domains. Seven classes such as artificialintelligence, algorithm, computer architecture,computer networks, database management systems,programming, software engineering are available inthis subtask-2. The number of training, validationand test texts for each of the task is different.Summary of the dataset presents in table 1.

Task No. ofclasses Train Dev Test

task-1a 5 23962 4850 2500task-1b 5 58500 5842 1923task-1c 5 36009 5724 2682task-1d 7 148445 14338 4211task-1e 3 40669 3390 1514task-1f 4 41997 3780 1788task-1g 6 72483 6190 2070task-1h 6 68865 5920 2611task-2a 7 13580 1360 1929

Table 1: Dataset description

4 System Overview

Figure 1 shows the schematic diagram of theproposed system. The system has four major parts:preprocessing, feature extraction, classifier modeland prediction. After processing the raw texts,Word2Vec word embedding technique is appliedon the processed texts to extract features. Afterexploiting inherent features of the texts, the modeltrained with CNN, BiLSTM and combination ofCNN & BiLSTM.. Finally, the trained model willuse to predict the class on the development set.

1https://ssmt.iiit.ac.in/techdofication.html

4.1 Preprocessing

In this step, all the punctuation’s (,.;:”!) andflawed characters (#,$, %,*,@) removed from theinput texts. Texts are having a length of fewerthan two words also discarded. Deep learningalgorithms could not possibly learn from the rawtexts. Thus, a numeric mapping of the input textsis created. A vocabulary of K unique words isdeveloped and each input text encoded into numericsequences based on word index in vocabulary. Byapplying the pad sequence method, each sequenceconverted into fixed-length vector. We chooseoptimal sequence length 100 as most of the lengthof the text ranges between 30-70 words. In order tomaintain a fixed length of inputs, zero paddings areused with the short text, and extra values discardedfrom the long sequences.

4.2 Feature Extraction

To extract features from texts and capture semanticproperty of a word Word2Vec (Mikolov et al.,2013) embedding technique is used. Embeddingmaps textual data into a dense vector by solving thesparsity problem. We use the default embeddinglayer of Keras to produce embedding matrix.Embedding layer has three parameters: vocabularysize, embedding dimension and length of texts.Embedding dimension determine the size of thedense word vector. The entire corpus is fitted intothe embedding layer for a specific subtask andchoose 100 as embedding dimension for all thesubtasks. Features extracted from the embeddinglayer propagated the rest of the network.

4.3 Classifier Model

In this work, CNN and BiLSTM are used forinitial model building. However, after combiningthese methods, we get superior results in severalsubtasks (Zhou et al., 2015). A description of theproposed architecture illustrates in the subsequentparagraphs.

CNN: In CNN, convolution filters capture theinherent syntactic and semantic features of thetexts. The proposed classifier considers two layers,one dimensional CNN. In each layer, there are128 filters with kernel size 5. To downsample thefeatures on CNN max-pooling technique is utilizedwhere pool size is 1×5. We have used a non-linearactivation function ‘relu’ with CNN.

36

Figure 1: Schematic diagram of our system

BiLSTM: We use Bidirectional LSTM networkto capture the sequential features from the inputtext and to avoid vanishing/exploding gradientproblems of simple RNN. We use two layers ofBiLSTM on top of each other, where each layer has128 LSTM cells. In order to reduce the overfittingon training data, the dropout technique is used witha dropout rate of 0.2. After achieving the hiddenrepresentation form, the LSTM layer output passedto the softmax layer for classification.

CNN+BiLSTM: In this approach, we mergeCNN and BiLSTM models with marginalmodification in network architecture. Previously,we used two layers of CNN and BiLSTM, whereasin this technique, discard one layer from eachnetwork and combine them sequentially. Wordembedding features is feed to the CNN, which has128 filters. After max pooling with a window ofsize 5, features of CNN propagated to the LSTMlayer. It has 128 bidirectional cells to capturethe sequential information. In order to mitigateoverfitting, a dropout layer is added with a dropoutrate of 0.2. Finally, the softmax layer gets inputfrom the LSTM and perform classification.

4.4 Prediction

The goal of the prediction module is to determinethe technical domain of an input text that ithas never seen before. For the prediction,sample instances are processed and converted intonumerical sequences by the tokenizer. Trainedmodel use this sequence to predict the associatedclass of the input text.

5 Experiments

Google co-laboratory platform is used to conductexperiments. Deep learning model developed withKeras=2.4.0 framework with tensorflow=2.3.0 in

the backend. For data preparation and evaluation,we use python=3.6.9 and secikit-learn=0.22.2.

5.1 Hyperparameter SettingsPerformance of deep learning models heavilydepends on the hyperparameters used in training.To choose the optimal hyperparameters forthe proposed model, we played with differentcombinations. We choose parameter values basedon its effect on the output. Table 2 exhibits thevalues of different hyperparameters considered totrain the proposed model. Adam optimizer is used

Hyperparameters Optimum valueEmbedding dimension 100Padding length 100Filters 128Kernel size 5Pooling type maxWindow size 5LSTM cell 128Dropout rate 0.2Optimizer ‘adam’Learning rate 0.001Batch size 128

Table 2: Hyperparameter Settings

with a learning rate of 0.001. The model trainedwith a batch size of 128 until a training accuracyof 98% reached. We use Keras callbacks to savethe intermediate model during training with bestvalidation accuracy. The trained model used topredict on the instances of development set.

5.2 ResultsWe determine the superiority of the models basedon their weighted f1 score on the development setof different tasks. Table 3 shows the evaluationresults of CNN, BiLSTM and CNN+BiLSTM.

37

Task CNN Bi-LSTM CNN+Bi-LSTMP R F P R F P R F

task-1a (English) 81.48 81.36 81.4 82.81 82.52 82.52 82.9 82.54 82.63task-1b (Bangla) 81.96 81.94 81.91 81.49 81.38 81.39 82.04 81.97 81.95task-1c ( Gujarati) 82.63 82.39 82.38 82.79 82.05 82.06 82.58 82.41 82.39task-1d (Hindi) 79.46 79.0 79.07 79.86 79.54 79.52 79.65 79.44 79.48task-1e (Malayalam) 91.51 91.53 91.52 91.87 91.89 91.86 91.38 91.36 91.32task-1f (Marathi) 85.93 85.93 85.84 86.47 86.53 86.48 86.57 86.51 86.38task-1g (Tamil) 84.26 84.07 84.02 84.36 84.34 84.3 84.56 84.63 84.37task-1h (Telegu) 86.85 86.82 86.78 87.64 87.34 87.41 87.17 87.14 87.13task-2a (English) 64.36 63.82 63.83 66.45 65.51 65.72 67.86 67.35 67.44

Table 3: Evaluation results of three models on different tasks where P, R, F denotes precision, recall and weightedf1 score.

Task Method A P R Ftask-1a (English) CNN+BiLSTM 70.76 71.50 70.76 70.63task-1b (Bangla) CNN+BiLSTM 79.97 81.50 82.41 80.25task-1c ( Gujarati) CNN+BiLSTM 65.45 1.95 1.81 1.86task-1d (Hindi) BiLSTM 57.28 57.13 55.99 54.57task-1e (Malayalam) BiLSTM 31.37 0.32 0.18 0.19task-1f (Marathi) BiLSTM 63.09 65.98 61.38 59.81task-1g (Tamil) CNN+BiLSTM 49.23 48.38 61.34 43.70task-1h (Telegu) BiLSTM 52.82 0.76 0.64 0.68task-2a (English) CNN+BiLSTM 70.14 71.51 70.19 70.40

Table 4: Evaluation results on the test set. Here A, P, R, F denotes accuracy, precision, recall and weighted f1score respectively.

The results revealed that BiLSTM model achievedthe higher f1 score of 79.52%, 91.86%, 86.48%and 87.41% for tasks 1d, 1e, 1f and 1h. Itoutperforms CNN model for all tasks. The reasonbehind the superior results of LSTM because ofits capability to capture long-range dependencies.However, combined CNN and BiLSTM provideinteresting insights. It outdoes previous BiLSTMmodel in tasks 1a, 1b, 1c, 1g and 2a by obtaining82.63%, 81.95%, 82.39%, 84.37% and 67.4%4 f1scores. The model achieved 2% rise in f1 scoreconcerning task-2a where the fine-grained domainof a text is identified. In all the cases, there existsa small difference (< 0.5%) between the result ofBiLSTM and CNN+BiLSTM. By analyzing theresults, it observed that for a task with less numberof classes, all models achieved quite similarperformance. However, when the number ofclasses increased, the BiLSTM and CNN+BiLSTMmodels performed better than CNN. It is becausethe CNN model could not capture sequentialfeature as well compare to LSTM.

Table 4 shows the output of the best run

on the test set for each tasks. Basedon the performance of the development set,methods are selected to predict on the testset. Therefore, we use CNN+BiLSTM modelto predict on the tasks 1a, 1b, 1c, 1g and 2a.Model achieved 70.63%, 80.25%, 1.86%, 43.70%and 70.4% weighted f1 scores on these tasksrespectively. Unlike other tasks, precision, recall,and f1 score are much lower for task 1c compareto the validation results. This lower score mighthappen due to some mistake during evaluation.Task 2a get better f1 score on the test set tocompare to the development set. For other cases,the performance of the methods degraded on thedevelopment set.

BiLSTM method used to get the outputs for tasks1d, 1e, 1f and 1h. Model obtained 57.13%, 0.32%,65.98% and 0.76% weighted f1 scores on thesetasks. It suspected that some errors might occurduring evaluation for tasks 1e and 1h. The modelachieved 91.86% and 87.41% f1 scores on thesetasks for the validation set but got an implausibleresult on the test set. This error might occur due to

38

Unicode issues of different languages. Our systemalso encountered an error when data read from thetext file. The performance of BiLSTM methoddecreased in the test set than the validation set forall tasks.

Precision, recall and f1 score have fallen foreach task in the test set except task 2a. Weightedf1 score has increased by 2.5% in the test set. Forall the tasks, we observed a substantial variationbetween the development set and test set results.There might be two possible reasons behind thisunpredictable nature of the models. First one,model is overfitted on the training set. Thus, itgets better results on training and validation set butpoor results on the test set. The second one, testdata are more diverse than training data. Supposesignificant overlap does not exist between the trainand test features. In that case, the model indeedperforms poor on the test data since the modelslearn from the characteristics of training data.

6 Conclusion

This paper presents a detail description of theproposed system and its evaluation for the technicaltexts classification in different languages. As thebaseline method, we used CNN and BiLSTM, andcompare these methods with the proposed model(combined CNN and BiLSTM). Each model istrained, tuned and evaluated separately for subtasks1 and 2. The proposed method showed betterperformance in terms of accuracy for subtasks (a,b, c, g) of task 1 and task 2a on developmentset. However, in the case of test set, the systemperformed better for the subtasks 1a, 1b, 1c, 1gand 2a. More dataset can be included for improvedperformance. In future, the attention mechanismmay be explored to observe its effects on textclassification tasks.

ReferencesMohamed Seghir Hadj Ameur, Riadh Belkebir, and

Ahmed Guessoum. 2020. Robust arabic textcategorization by combining convolutional andrecurrent neural networks. ACM Transactionson Asian and Low-Resource Language InformationProcessing (TALLIP), 19(5):1–16.

Minh Duc Cao and Xiaoying Gao. Combining contentsand citations for scientific document classification.In Int. Conf. on Advances in Artificial Intelligence.

Eftekhar Hossain, Omar Sharif, Mohammed MoshiulHoque, and Iqbal H Sarker. 2020. Sentilstm:

A deep learning approach for sentimentanalysis of restaurant reviews. arXiv preprintarXiv:2011.09684.

Alon Jacovi, Oren Sar Shalom, and Yoav Goldberg.2018. Understanding convolutional neuralnetworks for text classification. arXiv preprintarXiv:1809.08037.

Yoon Kim. 2014. Convolutional neural networksfor sentence classification. arXiv preprintarXiv:1408.5882.

Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg SCorrado, and Jeff Dean. 2013. Distributedrepresentations of words and phrases and theircompositionality. In Advances in neural informationprocessing systems, pages 3111–3119.

Shervin Minaee, Nal Kalchbrenner, Erik Cambria,Narjes Nikzad, Meysam Chenaghlu, and JianfengGao. 2020. Deep learning based text classification:A comprehensive review. arXiv preprintarXiv:2004.03705.

Mr Nihar M Ranjan, YR Ghorpade, GR Kanthale,AR Ghorpade, and AS Dubey. 2017. Documentclassification using lstm neural network. Journal ofData Mining and Management, 2(2):1–9.

Chunting Zhou, Chonglin Sun, Zhiyuan Liu, andFrancis Lau. 2015. A c-lstm neural network for textclassification. arXiv preprint arXiv:1511.08630.

Peng Zhou, Zhenyu Qi, Suncong Zheng, JiamingXu, Hongyun Bao, and Bo Xu. 2016. Textclassification improved by integrating bidirectionallstm with two-dimensional max pooling. arXivpreprint arXiv:1611.06639.

39


An Attention Ensemble Approach for Efficient Text Classification ofIndian Languages

Atharva Kulkarni1, Amey Hengle1, and Rutuja Udyawar2

1Department of Computer Engineering, PVG’s COET, Savitribai Phule Pune University, India.2Optimum Data Analytics, India.

1{atharva.j.kulkarni1998, ameyhengle22}@[email protected]

Abstract

The recent surge of complex attention-baseddeep learning architectures has led to extraor-dinary results in various downstream NLPtasks in the English language. However, suchresearch for resource-constrained and morpho-logically rich Indian vernacular languages hasbeen relatively limited. This paper proffersteam SPPU AKAH’s solution for the TechD-Ofication 2020 subtask-1f: which focuses onthe coarse-grained technical domain identifica-tion of short text documents in Marathi, a De-vanagari script-based Indian language. Avail-ing the large dataset at hand, a hybrid CNN-BiLSTM attention ensemble model is pro-posed that competently combines the inter-mediate sentence representations generated bythe convolutional neural network and the bidi-rectional long short-term memory, leading toefficient text classification. Experimental re-sults show that the proposed model outper-forms various baseline machine learning anddeep learning models in the given task, givingthe best validation accuracy of 89.57% and f1-score of 0.8875. Furthermore, the solution re-sulted in the best system submission for thissubtask, giving a test accuracy of 64.26% andf1-score of 0.6157, transcending the perfor-mances of other teams as well as the baselinesystem given by the organizers of the sharedtask.

1 Introduction

The advent of attention-based neural networks andthe availability of large labelled datasets has re-sulted in great success and state-of-the-art perfor-mance for English text classification (Yang et al.,2016; Zhou et al., 2016; Wang et al., 2016; Gaoet al., 2018). Such results, however, for Indian lan-guage text classification tasks are far and few asmost of the research employ traditional machinelearning and deep learning models (Joshi et al.,

2019; Tummalapalli et al., 2018; Bolaj and Gov-ilkar, 2016a,b; Dhar et al., 2018). Apart frombeing heavily consumed in the print format, thegrowth in the Indian languages internet user baseis monumental, scaling from 234 million in 2016to 536 million by 2021 1. Even so, just like mostother Indian languages, the progress in NLP forMarathi has been relatively constrained, due to fac-tors such as the unavailability of large-scale train-ing resources, structural un-similarity with the En-glish language, and a profusion of morphologicalvariations, thus, making the generalization of deeplearning architectures to languages like Marathidifficult.

This work posits a solution for the TechDOfica-tion 2020 subtask-1f: coarse-grained domain clas-sification for short Marathi texts. The task providesa large corpus of Marathi text documents spanningacross four domains: Biochemistry, Communica-tion Technology, Computer Science, and Physics.Efficient domain identification can potentially im-pact, and improve research in downstream NLPapplications such as question answering, transliter-ation, machine translation, and text summarization,to name a few. Inspired by the works of (Er et al.,2016; Guo et al., 2018; Zheng and Zheng, 2019),a hybrid CNN-BiLSTM attention ensemble modelis proposed in this work. In recent years, Convo-lutional Neural Networks (Kim, 2014; Conneauet al., 2016; Zhang et al., 2015; Liu et al., 2020;Le et al., 2017) and Recurrent Neural Networks(Liu et al., 2016; Sundermeyer et al., 2015) havebeen used quite frequently for text classificationtasks. Quite different from one another, the CNNsand the RNNs show different capabilities to gener-ate intermediate text representation. CNN modelsan input sentence by utilizing convolutional filtersto identify the most influential n-grams of differ-

1https://home.kpmg/in/en/home/insights/2017/04/indian-language-internet-users.html

40

ent semantic aspects (Conneau et al., 2016). RNNcan handle variable-length input sentences and isparticularly well suited for modeling sequentialdata, learning important temporal features and long-term dependencies for robust text representation(Hochreiter and Schmidhuber, 1997). However,whilst CNN can only capture local patterns andfails to incorporate the long-term dependencies andthe sequential features, RNN cannot distinguishbetween keywords that contribute more context tothe classification task from the normal stopwords.Thus, the proposed model hypothesizes a potentway to subsume the advantages of both the CNNand the RNN using the attention mechanism. Themodel employs a parallel structure where both theCNN and the BiLSTM model the input sentencesindependently. The intermediate representations,thus generated, are combined using the attentionmechanism. Therefore, the generated vector hasuseful temporal features from the sequences gener-ated by the RNN according to the context generatedby the CNN. Results attest that the proposed modeloutperforms various baseline machine learning anddeep learning models in the given task, giving thebest validation accuracy and f1-score.

2 Related Work

Since the past decade, the research in NLP hasshifted from a traditional statistical standpoint tocomplex neural network architectures. The CNNand RNN based architectures have emerged greatlysuccessful for the text classification task. YoonKim was the first one who applied a CNN modelfor English text classification. In this work, a seriesof experiments were conducted with single as wellas multi-channel convolutional neural networks,built on top of randomly generated, pretrained, andfine-tuned word vectors (Kim, 2014). This successof CNN for text classification led to the emergenceof more complex CNN models (Conneau et al.,2016) as well as CNN models with character levelinputs (Zhang et al., 2015). RNNs are capable ofgenerating effective text representation by learn-ing temporal features and long-term dependenciesbetween the words (Hochreiter and Schmidhuber,1997; Graves and Schmidhuber, 2005). However,these methods treat each word in the sentencesequally and thus cannot distinguish between thekeywords that contribute more to the classificationand the common words. Hybrid models proposedby (Xiao and Cho, 2016) and (Hassan and Mah-

mood, 2017) succeed in exploiting the advantagesof both CNN and RNN, by using them in combina-tion for text classification.

Since the introduction of the attention mecha-nism (Vaswani et al., 2017), it has become an effec-tive strategy for dynamically learning the contribu-tion of different features to specific tasks. Needlessto say, the attention mechanism has expeditiouslyfound its way into NLP literature, with many workseffectively leveraging it to improve the text classi-fication task. (Guo et al., 2018) proposed a CNN -RNN attention-based neural network (CRAN) fortext classification. This work illustrates the effec-tiveness of using the CNN layer as a context ofthe attention model. Results show that using thismechanism enables the proposed model to pick theimportant words from the sequences generated bythe RNN layer, thus helping it to outperform manybaselines as well as hybrid attention-based modelsin the text classification task. (Er et al., 2016) pro-posed an attention pooling strategy, which focuseson making a model learn better sentence represen-tations for improved text classification. Authorsuse the intermediate sentence representations pro-duced by a BiLSTM layer in reference with thelocal representations produced by a CNN layer toobtain the attention weights. Experimental resultsdemonstrate that the proposed model outperformsstate-of-the-art approaches on a number of bench-mark datasets for text classification. (Zheng andZheng, 2019) combine the BiLSTM and CNN withthe attention mechanism for fine-grained text clas-sification tasks. The authors employ a method inwhich intermediate sentence representations gener-ated by BiLSTM are passed to a CNN layer whichis then max pooled to capture the local features of asentence. The local feature representations are fur-ther combined by using an attention layer to calcu-late the attention weights. In this way, the attentionlayer can assign different weights to features ac-cording to their importance to the text classificationtask.

The literature in NLP focusing on the resource-constrained Indian languages has been fairly re-strained. (Tummalapalli et al., 2018) evaluated theperformance of vanilla CNN, LSTM, and multi-Input CNN for the text-classification of Hindi andTelugu texts. The results indicate that CNN basedmodels perform surprisingly better as comparedto LSTM and SVM using n-gram features. (Joshiet al., 2019) have compared different deep learn-

41

Label Training Data Validation Databioche 5,002 420com tech 17,995 1,505cse 9,344 885phy 9,656 970Total 41,997 3,780

Table 1: Data distribution.

ing approaches for Hindi sentence classification.The authors have evaluated the effect of using pre-trained fasttext Hindi embeddings on the sentenceclassification task. The finest classification per-formance is achieved by the Vanilla CNN modelwhen initialized with fasttext word embeddingsfine-tuned on the specific dataset.

3 Dataset

The TechDOfication-2020 subtask-1f dataset con-sists of labelled Marathi text documents, each be-longing to one of the four classes, namely: Bio-chemistry (bioche), Communication Technology(com tech), Computer Science (cse), and Physics(phy). The training data has a mean length of 26.89words with a standard deviation of 25.12.

Table 1 provides an overview of the distributionof the corpus across the four labels for training andvalidation data. From the table, it is evident thatthe dataset is imbalanced, with the class Commu-nication Technology and Biochemistry having themost and the least documents, respectively. It is,therefore, reasonable to postulate that this data im-balance may lead to the overfitting of the modelon some classes. This is further articulated in theResults section.

4 Proposed Model

This section describes the proposed multi-inputattention-based parallel CNN-BiLSTM. Figure 1depicts the model architecture. Each component isexplained in detail as follows:

4.1 Word Embedding Layer

The proposed model uses fasttext word embeddingstrained on the unsupervised skip-gram model tomap the words from the corpus vocabulary to acorresponding dense vector. Fasttext embeddingsare preferred over the word2vec (Mikolov et al.,2013) or glove variants (Pennington et al., 2014),as fasttext represents each word as a sequence

Figure 1: Model Architecture.

of character-n-grams, which in turn helps to cap-ture the morphological richness of languages likeMarathi. The embedding layer converts each wordwi in the document T = {w1, w2, ..., wn} of nwords, into a real-valued dense vector ei using thefollowing matrix-vector product:

ei =Wvi (1)

where W ∈ Rd×|V | is the embedding matrix, |V |is a fixed-sized vocabulary of the corpus and dis the word embedding dimension. vi is the one-hot encoded vector with the element ei set to 1while the other elements set to 0. Thus, the doc-ument can be represented as real-valued vectore = {e1, e2, ..., en}.

4.2 Bi-LSTM LayerThe word embeddings generated by the embed-dings layer are fed to the BiLSTM unit step bystep. A Bidirectional Long-short term memory (Bi-LSTM) (Graves and Schmidhuber, 2005) layer isjust a combination of two LSTMs (Hochreiter andSchmidhuber, 1997) running in opposite directions.

42

This allows the networks to have both forward andbackward information about the sequence at ev-ery time step, resulting in better understanding andpreservation of the context. It is also able to counterthe problem of vanishing gradients to a certain ex-tent by utilizing the input, the output, and the forgetgates. The intermediate sentence representationgenerated by Bi-LSTM is denoted as h.

4.3 CNN LayerThe discrete convolutions performed by the CNNlayer on the input word embeddings, help to extractthe most influential n-grams in the sentence. Threeparallel convolutional layers with three differentwindow sizes are used so that the model can learnmultiple types of embeddings of local regions, andcomplement one another. Finally, the sentence rep-resentations of all the different convolutions areconcatenated and max-pooled to get the most dom-inant features. The output is denoted as c.

4.4 Attention LayerThe linchpin of the model is the attention blockthat effectively combines the intermediate sentencefeature representation generated by BiLSTM withthe local feature representation generated by CNN.At each time step t, taking the output ht of theBiLSTM, and ct of the CNN, the attention weightsαt are calculated as:

ut = tanh(W1ht +W2ct + b) (2)

αt = Softmax(ut) (3)

Where W1 and W2 are the attention weights, andb is the attention bias learned via backpropagation.The final sentence representation s is calculated asthe weighted arithmetic mean based on the weightsα = {α1, α2, ..., αn}, and output of the BiLSTMh = {h1, h2, ..., hn}. It is given as:

s =n∑

t=1

αt ∗ ht (4)

Thus, the model is able to retain the merits of boththe BiLSTM and the CNN, leading to a more robustsentence representation. This representation is thenfed to a fully connected layer for dimensionalityreduction.

4.5 Classification LayerThe output of the fully connected attention layer ispassed to a dense layer with softmax activation topredict a discrete label out of the four labels in thegiven task.

5 Experimental Setup

Each text document is tokenized and padded to amaximum length of 100. Longer documents aretruncated. The work of (Kakwani et al., 2020) isreferred for selecting the optimal set of hyperpa-rameters for training the fasttext skip-gram model.The 300-dimensional fasttext word embeddings aretrained on the given corpus for 50 epochs, with aminimum token count of 1, and 10 negative exam-ples, sampled for each instance. The rest of thehyperparameter values were chosen as default (Bo-janowski et al., 2017). After training, an averageloss of 0.6338. was obtained over 50 epochs. Thevalidation dataset is used to tune the hyperparam-eters. The LSTM layer dimension is set to 128neurons with a dropout rate of 0.3. Thus, the BiL-STM gives an intermediate representation of 256dimensions. For the CNN block, we employ threeparallel convolutional layers with filter sizes 3, 4,and 5, each having 256 feature maps. A dropoutrate of 0.3 is applied to each layer. The local repre-sentations, thus, generated by the parallel CNNs arethen concatenated and max-pooled. All other pa-rameters in the model are initialized randomly. Themodel is trained end-to-end for 15 epochs, with theAdam optimizer (Kingma and Ba, 2014), sparsecategorical cross-entropy loss, a learning rate of0.001, and a minibatch size of 128. The best modelis stored and the learning rate is reduced by a factorof 0.1 if validation loss does not decline after twosuccessive epochs.

6 Baseline Models

The performance of the proposed model is com-pared with a host of machine learning and deeplearning models and the results are reported in ta-ble 3. They are as follows:

Feature Based models: Multinomial NaiveBayes with bag-of-words input (MNB + BoW),Multinomial Naive Bayes with tf-idf input (MNB+ TF-IDF), Linear SVC with bag-of-words input(LSVC + BoW), and Linear SVC with tf-idf input(LSVC + TF-IDF).

Basic Neural Networks: Feed forward Neuralnetwork with max-pooling (FFNN), CNN withmax-pooling (CNN), and BiLSTM with maxpool-ing (BiLSTM)

Complex Neural Networks: BiLSTM +atten-tion (Zhou et al., 2016) , serial BiLSTM-CNN

43

Metrics bioche com tech cse phyPrecision 0.9128 0.8831 0.9145 0.8931Recall 0.7976 0.9342 0.8949 0.8793F1-Score 0.8513 0.9079 0.9046 0.8862

Table 2: Detailed performance of the proposed modelon the validation data.

(Chen et al., 2017), and serial BiLSTM-CNN +attention.

7 Results and Discussion

The performance of all the models is listed in Ta-ble 3. The proposed model outperforms all othermodels in validation accuracy and weighted f1-score. It achieves better results than standaloneCNN and BiLSTM, thus reasserting the impor-tance of combining both the structures. The BiL-STM with attention model is similar to the pro-posed model, but the context is ignored. As theproposed model outperforms the BiLSTM withattention model, it proves the effectiveness of theCNN layer for providing context. Stacking a convo-lutional layer over a BiLSTM unit results in lowerperformance than the standalone BiLSTM. It canbe thus inferred that combining CNN and BiLSTMin a parallel way is much more effective than just se-rially stacking. Thus, the attention mechanism pro-posed is able to successfully unify the CNN and theBiLSTM, providing meaningful context to the tem-poral representation generated by BiLSTM. Table 2reports the detailed performance of the proposedmodel for the validation data. The precision andrecall for communication technology (com tech),computer science (cse), and physics(phy) labels arequite consistent. Biochemistry (bioche) label suf-fers from a high difference in precision and recall.This can be traced back to the fact that less amountof training data is available for the label, leading tothe model overfitting on it.

8 Conclusion and Future work

While NLP research in English is achieving newheights, the progress in low resource languages isstill in its nascent stage. The TechDOfication taskpaves the way for research in this field throughthe task of technical domain identification for textsin Indian languages. This paper proposes a CNN-BiLSTM based attention ensemble model for thesubtask-1f of Marathi text classification. The par-allel CNN-BiLSTM attention-based model unifies

Label Validation ValidationAccuracy F1-Score

MNB + Bow 86.74 0.8352MNB + TF-IDF 77.16 0.8010Linear SVC + Bow 85.76 0.8432Linear SVC + TF-IDF 88.17 0.8681FFNN 76.11 0.7454CNN 86.67 0.8532BiLSTM 89.31 0.8842BiLSTM + Attention 88.14 0.8697Serial BiLSTM-CNN 88.99 0.8807Serial BiLSTM-CNN+ Attention 88.23 0.8727Ensemble CNN-BiLSTM+ Attention 89.57 0.8875

Table 3: Performance comparison of different modelson the validation data.

the intermediate representations generated by boththe models successfully using the attention mech-anism. It provides a way for further research inadapting attention-based models for low resourceand morphologically rich languages. The perfor-mance of the model can be enhanced by givingadditional inputs such as character n-grams anddocument-topic distribution. More efficient atten-tion mechanisms can be applied to further consoli-date the amalgamation of CNN and RNN.

ReferencesPiotr Bojanowski, Edouard Grave, Armand Joulin, and

Tomas Mikolov. 2017. Enriching word vectors withsubword information. Transactions of the Associa-tion for Computational Linguistics, 5:135–146.

Pooja Bolaj and Sharvari Govilkar. 2016a. A surveyon text categorization techniques for indian regionallanguages. International Journal of computer sci-ence and Information Technologies, 7(2):480–483.

Pooja Bolaj and Sharvari Govilkar. 2016b. Text clas-sification for marathi documents using supervisedlearning methods. International Journal of Com-puter Applications, 155(8):6–10.

Tao Chen, Ruifeng Xu, Yulan He, and Xuan Wang.2017. Improving sentiment analysis via sentencetype classification using bilstm-crf and cnn. ExpertSystems with Applications, 72:221–230.

Alexis Conneau, Holger Schwenk, Loıc Barrault,and Yann Lecun. 2016. Very deep convolutionalnetworks for text classification. arXiv preprintarXiv:1606.01781.

44

Ankita Dhar, Himadri Mukherjee, Niladri Sekhar Dash,and Kaushik Roy. 2018. Performance of classifiersin bangla text categorization. In 2018 InternationalConference on Innovations in Science, Engineeringand Technology (ICISET), pages 168–173. IEEE.

Meng Joo Er, Yong Zhang, Ning Wang, and Mahard-hika Pratama. 2016. Attention pooling-based convo-lutional neural network for sentence modelling. In-formation Sciences, 373:388–403.

Shang Gao, Arvind Ramanathan, and Georgia Tourassi.2018. Hierarchical convolutional attention net-works for text classification. In Proceedings ofThe Third Workshop on Representation Learning forNLP, pages 11–23.

Alex Graves and Jurgen Schmidhuber. 2005. Frame-wise phoneme classification with bidirectional lstmand other neural network architectures. Neural net-works, 18(5-6):602–610.

Long Guo, Dongxiang Zhang, Lei Wang, Han Wang,and Bin Cui. 2018. Cran: a hybrid cnn-rnn attention-based model for text classification. In InternationalConference on Conceptual Modeling, pages 571–585. Springer.

A. Hassan and A. Mahmood. 2017. Efficient deeplearning model for text classification based on recur-rent and convolutional layers. In 2017 16th IEEEInternational Conference on Machine Learning andApplications (ICMLA), pages 1108–1113.

Sepp Hochreiter and Jurgen Schmidhuber. 1997.Long short-term memory. Neural computation,9(8):1735–1780.

Ramchandra Joshi, Purvi Goel, and Raviraj Joshi. 2019.Deep learning for hindi text classification: A com-parison. In International Conference on Intelli-gent Human Computer Interaction, pages 94–101.Springer.

Divyanshu Kakwani, Anoop Kunchukuttan, SatishGolla, NC Gokul, Avik Bhattacharyya, Mitesh MKhapra, and Pratyush Kumar. 2020. inlpsuite:Monolingual corpora, evaluation benchmarks andpre-trained multilingual language models for indianlanguages. In Proceedings of the 2020 Conferenceon Empirical Methods in Natural Language Process-ing: Findings, pages 4948–4961.

Yoon Kim. 2014. Convolutional neural net-works for sentence classification. arXiv preprintarXiv:1408.5882.

Diederik P Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980.

Hoa T Le, Christophe Cerisara, and Alexandre De-nis. 2017. Do convolutional networks need tobe deep for text classification? arXiv preprintarXiv:1707.04108.

Pengfei Liu, Xipeng Qiu, and Xuanjing Huang.2016. Recurrent neural network for text classi-fication with multi-task learning. arXiv preprintarXiv:1605.05101.

Zhenyu Liu, Haiwei Huang, Chaohong Lu, andShengfei Lyu. 2020. Multichannel cnn with at-tention for text classification. arXiv preprintarXiv:2006.16174.

Tomas Mikolov, Kai Chen, Greg Corrado, and Jef-frey Dean. 2013. Efficient estimation of wordrepresentations in vector space. arXiv preprintarXiv:1301.3781.

Jeffrey Pennington, Richard Socher, and ChristopherManning. 2014. Glove: Global vectors for word rep-resentation. In Proceedings of the 2014 Conferenceon Empirical Methods in Natural Language Process-ing (EMNLP), pages 1532–1543, Doha, Qatar. Asso-ciation for Computational Linguistics.

Martin Sundermeyer, Hermann Ney, and Ralf Schluter.2015. From feedforward to recurrent lstm neural net-works for language modeling. IEEE/ACM Transac-tions on Audio, Speech, and Language Processing,23(3):517–529.

Madhuri Tummalapalli, Manoj Chinnakotla, and Rad-hika Mamidi. 2018. Towards better sentence classi-fication for morphologically rich languages. In Pro-ceedings of the International Conference on Compu-tational Linguistics and Intelligent Text Processing.

Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N. Gomez, LukaszKaiser, and Illia Polosukhin. 2017. Attention is allyou need.

Yequan Wang, Minlie Huang, Xiaoyan Zhu, andLi Zhao. 2016. Attention-based lstm for aspect-level sentiment classification. In Proceedings of the2016 conference on empirical methods in naturallanguage processing, pages 606–615.

Yijun Xiao and Kyunghyun Cho. 2016. Efficientcharacter-level document classification by combin-ing convolution and recurrent layers.

Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He,Alex Smola, and Eduard Hovy. 2016. Hierarchi-cal attention networks for document classification.In Proceedings of the 2016 conference of the NorthAmerican chapter of the association for computa-tional linguistics: human language technologies,pages 1480–1489.

Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015.Character-level convolutional networks for text clas-sification. In Advances in neural information pro-cessing systems, pages 649–657.

Jin Zheng and Limin Zheng. 2019. A hybrid bidi-rectional recurrent convolutional neural networkattention-based model for text classification. IEEEAccess, 7:106673–106685.

45

Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, Bingchen Li,Hongwei Hao, and Bo Xu. 2016. Attention-basedbidirectional long short-term memory networks forrelation classification. In Proceedings of the 54thAnnual Meeting of the Association for Computa-tional Linguistics (Volume 2: Short Papers), pages207–212.

46

Author Index

Ala, Hema, 27Anusha, M D, 1

Balouchzahi, Fazlourrahman, 1

Chhajer, Akshat, 31

Das, Ayan, 6Dowlagar, Suman, 16

Gahoi, Akshat, 31Ghosh, Koyel, 21Gundapu, Sunil, 11

Hengle, Amey, 40Hoque, Mohammed Moshiul, 35Hossain, Eftekhar, 35

Kuila, Alapan, 6Kulkarni, Atharva, 40

Maity, Dr. Ranjan, 21Mamidi, Radhika, 11, 16Mishra Sharma, Dipti, 31

Sarkar, Sudeshna, 6Senapati, Dr. Apurbalal, 21Sharif, Omar, 35Sharma, Dipti, 27Shashirekha, H L, 1

Udyawar, Rutuja, 40

47

proceedings of the 17th international conference on

Documents