semi-supervised learningbased tagrecommendation for docker...

15
Chen W, Zhou JH, Zhu JX et al. Semi-supervised learning based tag recommendation for Docker repositories. JOURNAL OF COMPUTER SCIENCE AND TECHNOLOGY 34(5): 957-971 Sept. 2019. DOI 10.1007/s11390-019-1954-4 Semi-Supervised Learning Based Tag Recommendation for Docker Repositories Wei Chen 1,2 , Member, CCF, Jia-Hong Zhou 1,2 , Jia-Xin Zhu 1,2 , Member, CCF, Guo-Quan Wu 1,2,3 , Member, CCF and Jun Wei 1,2,3 , Member, CCF 1 Institute of Software, Chinese Academy of Sciences, Beijing 100190, China 2 University of Chinese Academy of Sciences, Beijing 100049, China 3 State Key Laboratory of Computer Sciences, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China E-mail: {wchen, zhoujiahong17, zhujiaxin, gqwu, wj}@otcaix.iscas.ac.cn Received February 28, 2019; revised July 12, 2019. Abstract Docker has been the mainstream technology of providing reusable software artifacts recently. Developers can easily build and deploy their applications using Docker. Currently, a large number of reusable Docker images are publicly shared in online communities, and semantic tags can be created to help developers effectively reuse the images. However, the communities do not provide tagging services, and manually tagging is exhausting and time-consuming. This paper addresses the problem through a semi-supervised learning-based approach, named SemiTagRec. SemiTagRec contains four components: (1) the predictor, which calculates the probability of assigning a specific tag to a given Docker repository; (2) the extender, which introduces new tags as the candidates based on tag correlation analysis; (3) the evaluator, which measures the candidate tags based on a logistic regression model; (4) the integrator, which calculates a final score by combining the results of the predictor and the evaluator, and then assigns the tags with high scores to the given Docker repositories. SemiTagRec includes the newly tagged repositories into the training data for the next round of training. In this way, SemiTagRec iteratively trains the predictor with the cumulative tagged repositories and the extended tag vocabulary, to achieve a high accuracy of tag recommendation. Finally, the experimental results show that SemiTagRec outperforms the other approaches and SemiTagRec’s accuracy, in terms of Recall@5 and Recall@10, is 0.688 and 0.781 respectively. Keywords tag recommendation, Docker repository, Dockerfile, semi-supervised learning 1 Introduction Docker [1] is a well-known open source container engine that continues to dominate the container landscape 1 . Developers can distribute and test soft- ware that they developed easily and quickly in sepa- rated OS environment by Docker [2] . Docker packages an application with its dependen- cies and execution environment into a standardized and self-contained unit named Docker image. A Docker con- tainer, being a runtime instance of a Docker image, runs the application natively on the host machine’s kernel. A Dockerfile, following the notion of Infrastructure-as- Code (IaC) [3] , contains all the instructions of building a Docker image. Docker has flourishing communities that contain a large number of Docker repositories. Each Docker repository (repository for short hereafter) contains reusable artifacts, i.e., Dockerfiles and Docker images. The communities store a large number of Docker reposi- tories. According to the statistics, Docker Hub 2 serves more than 12 billion image pulls per week [4] . So far, Regular Paper Special Section on Software Systems 2019 A preliminary version of the paper was published in the Proceedings of Internetware 2018. This work was supported by the National Natural Key Research and Development Program of China under Grant No. 2016YFB1000803, and the National Natural Science Foundation of China under Grant Nos. 61732019 and 61572480. 1 2017 annual container, adoption survey: Huge growth in containers, 2017. https://portworx.com/2017-container-adoption- survey/, July 2019. 2 https://hub.docker.com/, July 2019. ©2019 Springer Science + Business Media, LLC & Science Press, China

Upload: others

Post on 10-Oct-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Semi-Supervised LearningBased TagRecommendation for Docker ...jcst.ict.ac.cn/fileup/1000-9000/PDF/2019-5-3-9523.pdf · Docker[1] is a well-known open source container engine that

Chen W, Zhou JH, Zhu JX et al. Semi-supervised learning based tag recommendation for Docker repositories. JOURNAL

OF COMPUTER SCIENCE AND TECHNOLOGY 34(5): 957-971 Sept. 2019. DOI 10.1007/s11390-019-1954-4

Semi-Supervised Learning Based Tag Recommendation for Docker

Repositories

Wei Chen1,2, Member, CCF, Jia-Hong Zhou1,2, Jia-Xin Zhu1,2, Member, CCF, Guo-Quan Wu1,2,3, Member, CCF

and Jun Wei1,2,3, Member, CCF

1Institute of Software, Chinese Academy of Sciences, Beijing 100190, China2University of Chinese Academy of Sciences, Beijing 100049, China3State Key Laboratory of Computer Sciences, Institute of Software, Chinese Academy of Sciences, Beijing 100190, China

E-mail: {wchen, zhoujiahong17, zhujiaxin, gqwu, wj}@otcaix.iscas.ac.cn

Received February 28, 2019; revised July 12, 2019.

Abstract Docker has been the mainstream technology of providing reusable software artifacts recently. Developers can

easily build and deploy their applications using Docker. Currently, a large number of reusable Docker images are publicly

shared in online communities, and semantic tags can be created to help developers effectively reuse the images. However,

the communities do not provide tagging services, and manually tagging is exhausting and time-consuming. This paper

addresses the problem through a semi-supervised learning-based approach, named SemiTagRec. SemiTagRec contains four

components: (1) the predictor, which calculates the probability of assigning a specific tag to a given Docker repository;

(2) the extender, which introduces new tags as the candidates based on tag correlation analysis; (3) the evaluator, which

measures the candidate tags based on a logistic regression model; (4) the integrator, which calculates a final score by

combining the results of the predictor and the evaluator, and then assigns the tags with high scores to the given Docker

repositories. SemiTagRec includes the newly tagged repositories into the training data for the next round of training. In this

way, SemiTagRec iteratively trains the predictor with the cumulative tagged repositories and the extended tag vocabulary,

to achieve a high accuracy of tag recommendation. Finally, the experimental results show that SemiTagRec outperforms

the other approaches and SemiTagRec’s accuracy, in terms of Recall@5 and Recall@10, is 0.688 and 0.781 respectively.

Keywords tag recommendation, Docker repository, Dockerfile, semi-supervised learning

1 Introduction

Docker[1] is a well-known open source container

engine that continues to dominate the container

landscape 1○. Developers can distribute and test soft-

ware that they developed easily and quickly in sepa-

rated OS environment by Docker[2].

Docker packages an application with its dependen-

cies and execution environment into a standardized and

self-contained unit named Docker image. A Docker con-

tainer, being a runtime instance of a Docker image, runs

the application natively on the host machine’s kernel.

A Dockerfile, following the notion of Infrastructure-as-

Code (IaC)[3], contains all the instructions of building

a Docker image.

Docker has flourishing communities that contain

a large number of Docker repositories. Each Docker

repository (repository for short hereafter) contains

reusable artifacts, i.e., Dockerfiles and Docker images.

The communities store a large number of Docker reposi-

tories. According to the statistics, Docker Hub 2○ serves

more than 12 billion image pulls per week[4]. So far,

Regular Paper

Special Section on Software Systems 2019

A preliminary version of the paper was published in the Proceedings of Internetware 2018.

This work was supported by the National Natural Key Research and Development Program of China under GrantNo. 2016YFB1000803, and the National Natural Science Foundation of China under Grant Nos. 61732019 and 61572480.

1○2017 annual container, adoption survey: Huge growth in containers, 2017. https://portworx.com/2017-container-adoption-survey/, July 2019.

2○https://hub.docker.com/, July 2019.

©2019 Springer Science +Business Media, LLC & Science Press, China

Page 2: Semi-Supervised LearningBased TagRecommendation for Docker ...jcst.ict.ac.cn/fileup/1000-9000/PDF/2019-5-3-9523.pdf · Docker[1] is a well-known open source container engine that

958 J. Comput. Sci. & Technol., Sept. 2019, Vol.34, No.5

Docker Store stores more than 2.19 million reposito-

ries, and the number is still increasing. Developers can

reuse these artifacts by pulling down the off-the-shelf

images or building images using the Dockerfiles.

Given such a large number of Docker reposito-

ries, effective reuse requires a good understanding of

them, and semantic tags provide such a way. Tag-

ging is effective in bookmarking and classifying soft-

ware objects[5,6]. For example, users can find the tar-

get Docker artifacts by using the corresponding tags as

search keywords. Furthermore, semantic tags help users

to understand Docker artifacts easily, without reading

their description documents and code.

However, Docker Hub and Docker Store do not

provide tagging services, and manual tagging is still

exhausting and time-consuming. At present, some

approaches have been proposed to automatically tag

the conventional software objects[5]. For instance,

EnTagRec[7] takes text descriptions as input and rec-

ommends tags based on a supervised learning method.

Unfortunately, the existing approaches cannot be

applied to our scenario. On the one hand, there is

little available training data, which makes the multi-

label learning based approaches not work. On the other

hand, the tag vocabulary is limited and incapable of

providing plenty of semantic topics. Unlike the other

software information sites (e.g., Stack Overflow and Ask

Ubuntu), most Docker repositories have no tags, and

the communities do not maintain semantic tag vocabu-

laries. We observe that although Docker Hub employs

so-called tags to make version control, the tags are ver-

sion numbers without semantic information. These tags

are different from what we denote in this paper.

This paper attempts to address the problems

through a semi-supervised learning (SSL) based ap-

proach, SemiTagRec, for Docker repositories. SemiTag-

Rec comprises four components: 1) the predictor, which

calculates the probability of assigning a specific tag to

a given Docker repository; 2) the extender, which in-

troduces new tags as the candidates based on tag cor-

relation analysis, i.e., extending the tag vocabulary; 3)

the evaluator, which measures the candidate tags based

on a logistic regression (LR) model[8]; 4) the integrator,

which calculates a final score by combining the results

of the predictor and the evaluator, and then assigns the

tags with high scores to the given Docker repositories.

SemiTagRec includes the newly tagged repositories into

the training data for the next round of training. In this

way, SemiTagRec iteratively trains the predictor with

the cumulative tagged repositories and the extended

tag vocabulary, to achieve a high accuracy of tag rec-

ommendation.

We conduct the evaluation of SemiTagRec and com-

pare it with some other approaches. The experimental

results show that SemiTagRec outperforms the others

and SemiTagRec’s accuracy, in terms of Recall@5 and

Recall@10, is 0.688 and 0.781 respectively.

In summary, the contributions of this work are as

follows.

1) Approach. We propose a self-optimized approach

SemiTagRec to tagging a large number of Docker repos-

itories. It incrementally generates the training data and

extends the tag vocabulary. The approach is capable of

improving tagging accuracy by self-adapting the model

iteratively.

2) Dataset. We implement a prototype and col-

lect nearly 1 000 000 Docker repositories, and the proto-

type generates semantic tags for each repository. This

dataset is publicly accessible online 3○.

3) Evaluation. We conduct several experiments

to evaluate SemiTagRec. Firstly, we compare it

with the other related work, such as D-Tagger[9] and

EnTagRec[7]. The experimental results show that

SemiTagRec outperforms them in terms of Recall@5

and Recall@10. Secondly, we analyze the effect of ite-

rative training on the performance of SemiTagRec. Fi-

nally, we evaluate the reasonability of generated tags.

The rest of this paper is organized as follows. Sec-

tion 2 briefly introduces the background and then an-

alyzes the problem. Section 3 elaborates the details

of SemiTagRec. Section 4 presents the experimental

setup, and Section 5 discusses the experimental results

and makes evaluations. Section 6 presents related work,

and finally Section 7 gives a conclusion.

2 Background and Problem Analysis

2.1 Background

Fig.1 and Fig.2 show an exemplary Docker reposi-

tory nytimes/nginx-vod-module 4○. A Docker repository

generally contains several kinds of important informa-

tion, e.g., repository name, Docker images, text de-

scriptions (including the short and the full), Docker-

files (Fig.2 is the Dockerfile of the repository shown

in Fig.1), version information and pull command for

downloading the image. It is worth noting that the

3○http://39.104.105.27:8000/, July 2019.4○https://hub.docker.com/r/nytimes/nginx-vod-module, July 2019.

Page 3: Semi-Supervised LearningBased TagRecommendation for Docker ...jcst.ict.ac.cn/fileup/1000-9000/PDF/2019-5-3-9523.pdf · Docker[1] is a well-known open source container engine that

Wei Chen et al.: Semi-Supervised Learning Based Tag Recommendation for Docker Repositories 959

Fig.1. Exemplary Docker repository.

Fig.2. Dockerfile of the exemplary repository.

Page 4: Semi-Supervised LearningBased TagRecommendation for Docker ...jcst.ict.ac.cn/fileup/1000-9000/PDF/2019-5-3-9523.pdf · Docker[1] is a well-known open source container engine that

960 J. Comput. Sci. & Technol., Sept. 2019, Vol.34, No.5

repository already contains so-called tags, but they are

version information and not the semantic tags we refer

to in this paper.

A Dockerfile is a configuration file containing several

kinds of instructions, such as “FROM”, “RUN”, “ENV”

and “CMD”. “FROM” specifies the base images. “RUN”

executes shell scripts to build an image. “ENV” sets en-

vironment variables. “CMD” starts services packaged

in the image. Therefore, Dockerfiles contain a lot of se-

mantic information, and we use them as an important

input in our approach.

2.2 Problem Analysis

Content-based tag recommendation has been widely

used in many fields, including Q&A sites, app stores,

software repositories, etc. Most approaches take tex-

tual information as input (particularly including on-

line profiles, comments and readme files) and use

machine learning algorithms (particularly supervised

learning) to perform tag recommendations. In general,

state-of-the-art approaches, such as EnTagRec[7] and

TagMulRec[10], use a large amount of labeled data for

training and use pre-defined tag vocabularies.

However, tagging Docker repositories is very diffe-

rent, making these existing approaches inapplicable.

The reasons are as follows.

Firstly, there is very little training data available

in the online Docker communities. We crawl 88 226

repositories from Docker Hub and find that all of them

have no tags. Nonetheless, we observe that among

the crawled repositories, there are 5 532 ones whose

GitHub code repositories are labeled with GitHub top-

ics (semantically similar to tags). Intuitively, the code

repositories and the Docker repositories represent the

same software systems, and thus the GitHub topics

can be used as semantic tags for Docker repositories.

Therefore, we take these 5 532 Docker repositories and

their corresponding GitHub topics as the initial tagged

Docker repository set (TDS). Despite this, the amount

of labeled data is too small to adequately train the pre-

dictor.

Secondly, unlike other software communities such as

Stack Overflow, Docker communities do not have the

predefined tag vocabulary. We survey the 5 532 Docker

repositories whose code repositories have GitHub top-

ics and find that there are only 750 unique high-quality

GitHub topics associated with them. As a result, the

tag vocabulary is too small to provide sufficient seman-

tic information.

To solve the above problems, we are motivated to

propose a semi-supervised learning based approach to

automatically tagging a large number of Docker reposi-

tories.

3 Methodology

As shown in Fig.3, SemiTagRec contains four com-

ponents, namely, the predictor, extender, evaluator and

integrator. Basically, SemiTagRec works in two phases,

i.e., training and prediction.

Algorithm 1 describes the training process using

pseudo-code. In the beginning, SemiTagRec conducts

the initialization. Specifically, it trains an LR model

with the manually labeled training data (i.e., evalua-

tor training data, ETD) and a tag correlation model

with the GitHub repository library (GRL), for the

evaluator and the extender respectively. SemiTagRec

then iteratively trains the predictor in multiple rounds

to improve the prediction accuracy and extend the

tag vocabulary of the predictor (PredTagSpace). In

each round, SemiTagRec trains an L-LDA[11] model

for the predictor with the tagged Docker repository

set (TDS). The predictor takes as input the untagged

Docker repository set (UDS) and recommends the top n

tags (PredTags = {(tag1, proba1), · · · , (tagn, proban)},

where probai is the probability score of tagi from the

predictor) for each repository in UDS. Next, the exten-

der analyzes the correlations between the tags in Pred-

Tags and the topics in GitHub topic library (GTL) and

takes the closely related topics together with PredTags

as the extended tag set (ExtenTags). Then the eva-

luator calculates the probability scores for the tags in

ExtenTags of taking them as the candidates and out-

puts the evaluation results (EvaTags). After that, for

each tag in the union set of PredTags and EvaTags,

the integrator calculates a linear combination score

and takes the tags with high scores (which means

Scorem,i > 0.085, according to our observation, see (5)

in Subsection 3.4) as the final result (NewTags). For

an untagged repository, if its NewTags is not empty

(NewTags is empty means that the repository has no

reasonable tags), it will be added into the newly tagged

Docker repository set (NewTDS). Finally, SemiTagRec

updates UDS and TDS by moving NewTDS from the

former into the latter, for the next round of training.

In this way, the sizes of TDS and PredTagSpace in-

crementally increase, and the performance of the pre-

dictor will improve and tend to be stable after the mul-

tiple iterations. Finally, we will obtain an optimized

L-LDA model as the final predictor.

Page 5: Semi-Supervised LearningBased TagRecommendation for Docker ...jcst.ict.ac.cn/fileup/1000-9000/PDF/2019-5-3-9523.pdf · Docker[1] is a well-known open source container engine that

Wei Chen et al.: Semi-Supervised Learning Based Tag Recommendation for Docker Repositories 961

Predictor Extender

(Tag Correlation Model)

GitHub Repository

Library (GRL)

Evaluator

(LR Model)EvaTags

Integrator

NewTags

Tagged Docker

Repository Set (TDS)

Untagged Docker

Repository Set (UDS)

TDS / TDS ⇁ NewTDS

UDS / UDSNewTDS

Evaluator Training

Data (ETD)

PredTags ExtenTags

NewTDSTake NewTags as the Final Tags of the

Corresponding Untagged Repositories

TrainInfer

Start Point

Final Recommendation

↼L-LDA Model)

Fig.3. SemiTagRec’s working process overview.

The prediction is similar to the training. Given an

untagged Docker repository, SemiTagRec executes the

four components in sequence, and finally, it outputs the

top q tags as the final recommendation.

Algorithm 1. Training Process

Input: UDS, TDS,GRL,ETD

Output: the optimized predictor

1: initializeExtender(GRL);

2: initializeEvaluator(ETD);

3: do

4: trainPredictor(L-LDA, TDS);

5: Initialize NewTDS as an empty list

6: for each repository γi in UDS

7: PredTags = predictor.predict(γi);

8: ExtenTags = extender.extend(PredTags);

9: EvaTags = evaluator.evaluate(ExtenTags, γi);

10: NewTags = integrator.combine(EvaTags,

PredTags);

11: if NewTags is not empty

12: NewTDS.append(γi, NewTags);

13: end if

14: end for

15: UDS = UDS −NewTDS;

16: TDS = NewTDS ∪ TDS;

17: until predictor’s performance converges

3.1 Predictor

Latent Dirichlet Allocation (LDA)[12] is a genera-

tive probabilistic model for collections of discrete data

such as text corpora. It is a Bayesian inference algo-

rithm based on unsupervised learning. LDA only takes

a document and an expected number of topics as in-

put, and it would output the possibility distribution of

topics rather than what the topics exactly are. As a

result, LDA is not suitable for our application scenario.

In contrast, L-LDA can take the tag vocabulary as in-

put and recommend the tags for untagged documents.

The predictor calculates the probability score of as-

signing a tag to an untagged Docker repository. It is

based on L-LDA[11], a state-of-the-art Bayesian infer-

ence algorithm based on supervised learning. The L-

LDA model has been proved effective in solving multi-

label learning problems[13]. Unlike LDA[12], L-LDA in-

corporates the supervision, which constrains the topic

model to use only topics corresponding to a document’s

label set.

SemiTagRec models a labeled Docker repository as

a document d and a tag set ts. Each document d is

represented as a tuple consisting of a list of word in-

dices w(d) = (w1, w2, · · · , wNd

), and the tag set ts is

represented by a list of binary tags presence/absence

indicators Λ(d) = (t1, t2, · · · , tK), where each wi ∈

Page 6: Semi-Supervised LearningBased TagRecommendation for Docker ...jcst.ict.ac.cn/fileup/1000-9000/PDF/2019-5-3-9523.pdf · Docker[1] is a well-known open source container engine that

962 J. Comput. Sci. & Technol., Sept. 2019, Vol.34, No.5

{1, 2, · · · , V } (1 6 i 6 Nd) and each tj ∈ {0, 1}

(1 6 j 6 K). Here Nd is the document length, V

is the vocabulary size, and K is the total number of the

unique tags in PredTagSpace.

When performing training, the predictor computes

the probability distribution of all the words in the vo-

cabulary for each tag in PredTagSpace. The predic-

tor constructs a K × V matrix Φ, where ϕk,v ∈ [0, 1]

(1 6 k 6 K, 1 6 v 6 V ) and∑V

v=1 ϕk,v = 1. ϕk,v is

the probability of a specific word wv being generated

from tag tk.

When performing prediction, the predictor com-

putes the probability distributions of all the tags in

PredTagSpace for each untagged Docker repository.

The predictor constructs an M × K matrix Θ, where

ϑm,k ∈ [0, 1] (1 6 m 6 M , 1 6 k 6 K) and∑K

k=1 ϑm,k = 1, ϑm,k is the likelihood of assigning tag

tk to the Docker repository dockerm. M is the num-

ber of untagged repositories. For a specific untagged

Docker repository dockerm, the probability score vec-

tor of taking the tags in PredTagSpace is ϑdockerm =

(ϑm,1, ϑm,2, · · · , ϑm,K). Finally, all the tags are ranked

according to their probability scores, and the predictor

recommends the top n tags as PredTags.

Φ =

ϕ1,1 ϕ1,2 · · · ϕ1,V

ϕ2,1 ϕ2,2 · · · ϕ2,V

......

......

ϕK,1 ϕK,2 · · · ϕK,V

, (1)

Θ =

ϑ1,1 ϑ1,2 · · · ϑ1,K

ϑ2,1 ϑ2,2 · · · ϑ2,K

......

......

ϑM,1 ϑM,2 · · · ϑM,K

. (2)

3.2 Extender

Initially, there are only a few tags available because

there are a small number of Docker repositories whose

code bases contain GitHub topics. The limited number

of tags is not enough to tag a large number of unlabeled

Docker repositories.

The extender addresses this problem. It adds

into PredTags the GitHub topics closely correlated to

the tags already in PredTags, according to their co-

occurrences in GitHub repositories. The rationale is

that most of the Docker repositories have source code

repositories in GitHub, and thus the two repositories

(Docker and source code) of a certain software system

would have the same (or similar at least) semantics in-

formation. According to the statistics from D-Tagger[9],

among the crawled 118 427 Docker repositories, there

are 105 606 (almost 90%) ones having GitHub code

repositories. Therefore, it is reasonable to use those

most popular GitHub topics for tagging Docker repos-

itories. In practice, we crawl the GitHub topics and

select the most popular ones as the GitHub topic li-

brary (GTL), which is a set of tags and represented

as GTL = {tag1, tag2, · · · , tagN}. The popularity of

a topic is measured as its total occurrences in GitHub

repositories.

The extender computes the tag correlation score for

each pair of tags in GTL. As (3) and (4) show, the ex-

tender creates an N×N tag correlation matrix TCM .

In the matrix, TCMi,j is the conditional probability

P (tagj |tagi), which denotes the probability of taking

tagj as a candidate when tagi is selected.

TCM =

TCM1,1 TCM1,2 · · · TCM1,N

TCM2,1 TCM2,2 · · · TCM2,N

......

......

TCMN,1 TCMN,2 · · · TCMN,N

, (3)

P (tagj|tagi) = TCMi,j =Count(tagi, tagj)

Count(tagi). (4)

In (4), Count(tagi) is the number of GitHub repos-

itories containing tagi, and Count(tagi, tagj) is the

number of repositories containing both tagi and tagj .

Given a Docker repository dockerm that contains

tagi, the extender can predict the probability of tagjbeing a tag of dockerm with (4). Algorithm 2 describes

the details of how to extend PredTags.

Algorithm 2. Extending PredTag

Input: PredTags, TCM , GTL, q (integer)

Output: ExtenTags (the candidate tags correlated withPredTags)

1: Create candidate tags list (CTList)

2: for every tagi in PredTags

3: for every tagj in GTL and tagj /∈ PredTags

4: probaj = probai × TCM [i, j]

5: append (tagj , probaj) to CTList

6: end for

7: append (tagi, probai) to CTList

8: end for

9: Sort CTList in descending order of probabilities

10: ExtenTags = topq (CTList)

11: return ExtenTags

Page 7: Semi-Supervised LearningBased TagRecommendation for Docker ...jcst.ict.ac.cn/fileup/1000-9000/PDF/2019-5-3-9523.pdf · Docker[1] is a well-known open source container engine that

Wei Chen et al.: Semi-Supervised Learning Based Tag Recommendation for Docker Repositories 963

3.3 Evaluator

LR[8] is an extension of the linear regression model

for classification problems. This work takes the

probabilities as the scores to evaluate the candidate

tags. Besides, LR has been proved effective in solving

the probability-based classification problems[9]. Fur-

thermore, we conduct an experiment to prove that LR

is more appropriate than other classifiers in our scenario

(see Subsection 5.1).

The evaluator, based on LR model[8], is responsible

for computing a probability score for each candidate in

the ExtenTags set. For a given tag and a given Docker

repository, the evaluator calculates the probability of

the tag belonging to the Docker repository.

According to the previous studies[6] and our obser-

vation, we initially propose 17 features (detailed in Ta-

ble 1) to be used in the LR model. Different from pre-

vious work, we include a number of features about the

Dockerfiles.

Table 1. Details of All Features

Feature Encoding Description

word nums Integer Number of words in the tag

length Integer Number of characters in the tag

is username Boolean Is user name or not

in project name Boolean Is in the project name or not

contain num Boolean Contain numeric characters or not

in full desc Boolean Is in the full description or not

in short desc Boolean Is in the short description or not

gh weight Integer Weight of tag in the GitHub taglibrary

in full title Boolean Is in the title of full description ornot

occur count Integer Times of tag occurrence in alldescriptions

in df comments Boolean Is in the comments of Dockerfile ornot

in df from Boolean Is in the FROM command or not

in df cmd Boolean Is in the CMD command or not

in df maintainer Boolean Is in the MAINTAINER commandor not

in df entrypoint Boolean Is in the ENTRYPOINT commandor not

in df run Boolean Is in the RUN command or not

in df env Boolean Is in the ENV command or not

A Dockerfile contains a set of key instructions. Ac-

cording to our observations, we speculate that the

repositories with similar functions would contain iden-

tical (or similar) key instructions. We consider the fol-

lowing five key instructions.

FROM. It declares the base image that a repository

depends on. According to the layered union file system,

all features (such as OS and software) of the ancestor

images will be inherited.

ENV. It sets environment variables, such as working

directory, default path and so on.

RUN. It is the most complicated instruction mainly

used for executing any shell commands, and one kind

of the most important operation is installing software.

CMD and ENTRYPOINT. These two instructions

are similar, and they usually declare the command to

launch certain services when creating a Docker con-

tainer instance.

To check whether the 17 features have multi-

collinearity, we conduct a pair-wise correlation analysis

using the Spearman rank correlation (ρ) matrix across

all features. We use a common threshold of ρ = ±0.7[14]

to determine the existence of multicollinearity. There

are no numbers in the matrix that exceed the threshold,

i.e., there is no multicollinearity.

We use stepwise regression (in the “both” direction

mode) to select appropriate features, and nine features

are chosen, i.e., “word nums”, “length”, “is username”,

“in full desc”, “in short desc”, “gh weight”, “in df co-

mments”, “in df from”, “in df cmd”.

With the selected features, we calculate the values

of the features by analyzing the descriptions (short de-

scription and full description) and the Dockerfile. For

example, the candidate tag “nginx” in the exemplary

repository nytimes/nginx-vod-module (shown in Fig.1

and Fig.2) will be modelled as (1, 5, 0, 1, 1, 1, 0, 0,

0).

We manually label 1 000 samples (500 positive sam-

ples and 500 negative samples) as evaluator training

data to train the LR model.

3.4 Integrator

Due to the limitation of the LR model[8], if a tag

represents a latent topic not occurring in the descrip-

tion and the Dockerfile, the evaluator cannot determine

whether to assign the tag to the Docker repository. In

consequence, some high-scored tags recommended by

the predictor would not be considered as reasonable

ones.

To handle this problem, we propose the integra-

tor, which combines the results of the predictor and

the evaluator. In particular, given a Docker repository

dockerm and a tag tagi, the integrator computes a final

ranking score Scorem,i based on (5). Predictorm,i and

Page 8: Semi-Supervised LearningBased TagRecommendation for Docker ...jcst.ict.ac.cn/fileup/1000-9000/PDF/2019-5-3-9523.pdf · Docker[1] is a well-known open source container engine that

964 J. Comput. Sci. & Technol., Sept. 2019, Vol.34, No.5

Evaluatorm,i are the probability scores of tagi belong-

ing to dockerm, which are calculated by the predictor

and the evaluator respectively, and α, β ∈ [0, 1] are the

linear weights.

Finally, SemiTagRec ranks the tags with their final

scores in descending order. In the training phase, we

select the tags with the highest scores and add them

into NewTags, and in the prediction, we recommend

the top q ones as the final result (TAGtop-qi ).

Scorem,i = α× Predictorm,i + β × Evaluatorm,i. (5)

4 Experimental Setup

4.1 Dataset

GitHub Repository Library (GRL). We crawl

1 193 875 GitHub repositories as GRL and investigate

their topics. We find that the topics are of very different

qualities, depending on the developers’ preferences and

domain knowledge. Some topics of high qualities are

very common and occur frequently among the reposi-

tories, and in contrast, the low-quality ones only occur

a few times. We make the statistics of the tag occur-

rences. Although there are thousands of unique topics

in GRL, only a very small proportion of topics are of

high quality. In detail, the statistical results are as fol-

lows.

1) We obtain 5 438 644 topics among which there are

344 881unique ones.

2) There are 98.07% topics occurring in less than

0.008 37% (100/1 193 875) repositories. Fig.4 shows the

statistics of the topic cumulative distribution.

0

1.0

0.8

0.6

0.4

0.2

0.0100 200

(100, 98.07%)

Frequency of Topics in GitHub Repositories

Cum

ula

tive P

roport

ion

300 400 500

Fig.4. Cumulative distribution of the GitHub topics frequency.(For a very few of topics, their frequencies are more than 500and this figure does not show them.)

3) We further make the statistics of the occurrences

of the 1 000 and the 10 000 most popular topics (occur-

ring most frequently). Fig.5 shows that the 1 000 topics

occur in 82.92% repositories and the 10 000 topics occur

in 96.51% repositories.

0

1.0

0.8

0.6

0.4

0.2

0.05 10

(10 000, 96.51%)

(1 000, 82.92%)

Size of GTL (Τ103)

Reposi

tories

Covera

ge R

ate

15 20 25

Fig.5. Statistics of repositories coverage rate along with the sizechange of GTL.

As a result, we make a trade-off between improving

tagging result and reducing time cost and select the

1 000 most popular topics as GTL for the extender. It

is worth noting that, if necessary, GTL can be extended

to contain much more popular topics.

UDS and TDS. We crawl 88 226 Docker repositories

containing rich description information from Docker

Hub 5○ (accessed in March 2018). Similar to the ex-

isting work[9], we make a three-round filtering, and the

details are in Table 2. We take the remaining 5 532

repositories after the 3rd filtering as the initial tagged

Docker repository Set TDS and take the rest 82 694

ones as the untagged Docker repository Set UDS.

Table 2. Three-Round Filtering of the Crawled

Docker Repositories

Round Description Number of

Remaining

Repositories

1st Filter out the repositories withoutGitHub code bases

81 233

2nd Filter out the untagged repositories 7 276

3rd Filter out the repositories whosetags are not in GTL

5 532

ETD. We randomly select about 100 untagged

repositories and generate more than 2 000 candidate

tags with the predictor and the extender. We then

5○https://hub.docker.com/, July 2019.

Page 9: Semi-Supervised LearningBased TagRecommendation for Docker ...jcst.ict.ac.cn/fileup/1000-9000/PDF/2019-5-3-9523.pdf · Docker[1] is a well-known open source container engine that

Wei Chen et al.: Semi-Supervised Learning Based Tag Recommendation for Docker Repositories 965

manually label these candidates, as a result, which in-

cludes 500 positive tags and 1 931 negative ones. We

then choose 1 000 samples (500 positive samples and

500 negative samples) as the evaluator training data

ETD.

4.2 Evaluation Metrics

Similar to the existing work[5,7], we use Recall@q as

the evaluation criterion, which is defined as (6).

Recall@q =1

n

n∑

i=1

|TAGtop-qi ∩ TAG

originali )|

|TAGoriginali |

. (6)

It is supposed that there are n untagged Docker

repositories, and for each repository dockeri, SemiTag-

Rec generates a top-q tag set TAGtop-qi , and TAG

originali

is its original tag set. In this paper, we use the GitHub

topics of a specific repository as TAGoriginali .

4.3 Research Questions

Our experiments aim to answer the following re-

search questions.

RQ1. Is the predictor effective for measuring candi-

date tags? This question includes two sub ones. 1) Is

the LR model effective for measuring candidate tags?

2) How is the importance of the features selected for

the LR model of the predictor?

To answer this question, we compare the LR model

with some common models, fit the LR model with the

1 000 manually labeled data described in Subsection

3.3, and analyze the fitting results.

RQ2. Is the iterative training effective in optimizing

the predictor of SemiTagRec? This question includes

two sub ones. 1) Is the iterative training effective in

extending PredTagSpace? 2) Can iterative training

improve the accuracy of the predictor?

To answer this question, we carry out the experi-

ment with the 5 532 repositories and their GitHub top-

ics. We randomly select one-10th of the repositories as

the test data and take the rest as the initial training

data. We then train the predictor iteratively and eva-

luate the accuracy of tag recommendation in terms of

Recall@5 and Recall@10.

RQ3. Is SemiTagRec effective in tagging a large

number of Docker repositories?

To answer this question, we carry out the experi-

ment with the 5 532 repositories and their GitHub top-

ics. We compare SemiTagRec with some other existing

approaches in terms of Recall@5 and Recall@10.

RQ4. How is the tag recommendation result of

SemiTagRec? This question includes two sub ones. 1)

Are the recommended tags reasonable? 2) Is SemiTag-

Rec helpful in bookmarking Docker repositories?

To answer this question, we randomly select

more than 100 untagged Docker repositories and use

SemiTagRec to generate tags for them. We conduct an

evaluation with some participants majoring in software

engineering about the reasonability of the tags. In addi-

tion, we use some case studies to show how SemiTagRec

semantically labels the repositories with tags.

4.4 Parameter Settings

SemiTagRec contains two parameters α and β (in

(5)) and they should be set appropriately. We employ

the grid search[15] to determine the parameter values

and set them to 0.91 and 0.09 respectively.

5 Experimental Results

All the experiments are conducted on a server with

8-core 3.50 GHz, 32 GB memory and Ubuntu 18.04.01

LTS (Linux version 4.15.0-34-generic).

5.1 RQ1

Many models can solve the classification problems,

e.g., LR[8], Naive Bayes (NB)[16], K-nearest neighbours

(KNN)[17] and random forest (RF)[18]. To evaluate

their classification performance, we conduct an experi-

ment on ETD. As Table 3 shows, LR is better than the

other classifiers in terms of accuracy (ACC), area under

curve (AUC) and F1 score.

Table 3. Performance of Different Classifiers

ACC AUC F1 Score

LR 0.910 0 0.950 4 0.900 8

NB 0.766 0 0.818 1 0.758 6

KNN 0.868 0 0.945 1 0.852 9

RF 0.890 0 0.946 7 0.890 1

The fitting results of the LR model are shown in Ta-

ble 4. The first column presents the intercept and fea-

ture variables of the model, the subsequent two columns

(Est and Std. Err.) tell the estimated value and stan-

dard error for the coefficients respectively, and the last

two columns (z value and P -value) show the results of

the statistic test whether the coefficients are zero. We

can see that most of the selected features are signifi-

cantly (at < 0.05 level) associated with the depen-

dent variable (whether the tag belongs to the Docker

repository), among which the feature “in df cmd” is the

most important one. The odds ratio of “in df cmd” is

Page 10: Semi-Supervised LearningBased TagRecommendation for Docker ...jcst.ict.ac.cn/fileup/1000-9000/PDF/2019-5-3-9523.pdf · Docker[1] is a well-known open source container engine that

966 J. Comput. Sci. & Technol., Sept. 2019, Vol.34, No.5

153.6, i.e., e5.034 4, where 5.034 4 is the coefficient of

“in df cmd”. It indicates that the tags in the CMD

instruction have much higher odds (153.6 times) of be-

longing to the Docker repository compared with the

other tags.

Table 4. Fitting Results of the Logistic Regression Model

Est Std. Err. z Value P -Value

(Intercept) −0.389 10 0.330 45 −1.177 00 0.239 00

word nums −0.755 80 0.312 45 −2.419 00 1.56e–02

length 0.049 30 0.034 28 1.438 00 0.150 40

is username −19.129 00 724.135 00 −0.026 00 0.978 90

in full desc 4.209 55 0.392 52 10.724 00 <2e–16

in short desc 4.156 08 0.842 11 4.935 00 8.00e–07

gh weight −0.112 60 0.012 52 −8.987 00 <2e–16

in df comments 2.937 51 0.697 91 4.209 00 2.56e–05

in df from 2.807 53 0.676 47 4.150 00 3.32e–05

in df cmd 5.034 40 2.241 69 2.246 00 2.47e–02

5.2 RQ2

To evaluate the effectiveness of iterative training,

this experiment only uses the predictor of SemiTagRec

to make tag recommendation. We use 4 979 repositories

of the 5 532 ones as the initial and iteratively train the

predictor in multiple rounds. We then use the rest 553

tagged repositories as the test data to evaluate the pre-

dictor. We conduct the experiment by increasing the

number of iterative trainings from 0 to 70, and Fig.6

shows the experimental results.

0 10

0.80

0.75

0.70

0.65

0.60

0.55

0.50

0.4520 30

Iterations

Recall@

k

40

SemiTagRec Recall@10SemiTagRec Recall@5Predictor Recall@10Predictor Recall@5

50 60 70

Fig.6. Experimental results of SemiTagRec and the individualpredictor.

Overall, the accuracy of the predictor (in terms

of Recall@5 and Recall@10, represented by the red

dash line and the green dash line) firstly increases and

then tends to be stable when the training iterates over

50 times. In particular, the accuracy of the predic-

tor increases from 0.514 to 0.590 (Recall@5) and from

0.589 to 0.652 (Recall@10), accounting for 14.79% and

10.70% increase respectively. When the training ite-

rates from 50 to 70 times, the accuracy changes slightly

and only increases from 0.590 to 0.598 (Recall@5) and

from 0.652 to 0.655 (Recall@10). We further investi-

gate the experimental results of iterating from 40 to 70

times and find that the predictor trained with 53 itera-

tions is optimal, whose accuracy is 0.600 (Recall@5)

and 0.661 (Recall@10) respectively. As such, in the fol-

lowing experiments, we integrate the predictor trained

with 53 iterations into SemiTagRec.

We also make the statistics of PredTagSpace size and

the number of TDS during the iterative training. Table

5 lists the results and we notice that the size of Pred-

TagSpace increases from 740 to 846, which proves that

our approach is capable of extending the tag vocabu-

lary.

Table 5. Statistics of PredTagSpace and TDS

Iteration PredTagSpace TDS

0 740 4 979

10 785 9 078

20 802 13 327

30 817 17 619

40 824 21 901

50 834 26 194

60 844 30 563

70 846 34 836

5.3 RQ3

Unlike the experiment in Subsection 5.2, the experi-

ment carried out in this subsection evaluates the overall

performance of SemiTagRec (including the other three

components in addition to the predictor), and the re-

sults are also in Fig.6. SemiTagRec’s accuracy (also in

terms of Recall@5 and Recall@10, represented by the

orange solid line and the blue solid line) is higher than

that of the individual predictor, which means that the

other three components are also helpful in improving

the tag recommendation performance. In particular,

after the 53 iterative trainings, there are 838 tags in

PredTagSpace and 27 191 tagged repositories, and the

accuracy of SemiTagRec is 0.688 (Recall@5) and 0.781

(Recall@10) respectively, which is 14.67% and 18.15%

higher than individual predictor respectively.

Page 11: Semi-Supervised LearningBased TagRecommendation for Docker ...jcst.ict.ac.cn/fileup/1000-9000/PDF/2019-5-3-9523.pdf · Docker[1] is a well-known open source container engine that

Wei Chen et al.: Semi-Supervised Learning Based Tag Recommendation for Docker Repositories 967

We then compare SemiTagRec with some other ap-

proaches, including EnTagRec[7] and D-Tagger[9]. In

addition, we also implement some approaches which

employ the components of SemiTagRec. LR is the ap-

proach that only uses the evaluator and PEE is the ap-

proach that integrates the predictor, the extender and

the evaluator. Note that we use the same training data

and test data for all of the approaches.

Table 6 lists the experimental results. SemiTagRec

outperforms all of the other approaches, and EnTagRec

(with 53 iterations) and D-Tagger[6] take the second

and the third place respectively. LR is with the low-

est accuracy. We compare the results of SemiTagRec

and PEE and find that SemiTagRec outperforms PEE,

which also denotes that the integrator component is

effective in improving the overall accuracy since it con-

siders the latent tags generated from the predictor. En-

TagRec contains two prediction components, Bayesian

inference component (BIC) and frequentist inference

component (FIC), and BIC is also based on the L-LDA

model. As such, we use two different L-LDA models

as BIC and FIC remained the same, and the two ex-

perimental results are different. The different results

also imply that the iteratively trained L-LDA model is

more effective than that only trained with the limited

tagged repositories. As for D-Tagger, it first uses an LR

model to generate a large number of training data and

then trains an L-LDA model with the large amount of

data. There are two shortcomings: on the one hand, D-

Tagger cannot ensure that all of the generated training

data is of high quality; and on the other hand, D-Tagger

requires much larger amount of data as its input. Even

though with more training data, D-Tagger’s accuracy

is still lower than that of SemiTagRec.

Table 6. Experimental Results of Different Approaches

Accuracy SemiTagRec D-Tagger EnTagRec LR PEE

0 Iter. 53 Iter.

Recall@5 0.688 0.613 0.569 0.646 0.497 0.557

Recall@10 0.781 0.678 0.667 0.743 0.591 0.651

5.4 RQ4

We evaluate the reasonability of the recommended

tags. We invite 11 Master students majoring in soft-

ware engineering to take part in our survey. Consi-

dering that the time cost and the human efforts are

expensive, we only randomly select 105 tagged Docker

repositories (with 645 tags in total) and send each par-

ticipant 10 repositories and their tags. The participants

evaluate whether the tags are reasonable according to

the repository description documents and their domain

knowledge. Each tag may be thought of unreasonable

(0), reasonable (1) or neutral (2).

Table 7 shows the survey result. Among all the 645

tags, 402 ones are considered reasonable, 152 ones are

unreasonable and the rest are neutral. Therefore, the

overall precision is 62.33% (taking the neutral ones and

unreasonable ones as negative) or 72.56% (ignoring the

neutral ones and taking the unreasonable ones as neg-

ative).

Table 7. Evaluation Result

Score Evaluation Number of Tags Percentage (%)

0 Unreasonable 152 23.56

1 Reasonable 402 62.33

2 Neutral 91 14.11

In addition, we make the statistics of the repository

distributions according to the precision of their tags.

As Fig.7 shows, the majority of the 105 repositories

are tagged with a high precision. Statistically, for more

than half of the repositories, the precision is higher than

80%.

0.0 0.2

0.35

0.30

0.25

0.20

0.15

0.10

0.05

0.000.4

Precision

Pro

port

ion

0.6 0.8 1.0

Fig.7. Precision distribution of repositories on evaluation.

We further investigate the evaluation result and use

some case studies to show how SemiTagRec labels the

Docker repositories.

Case 1. Docker repository vhtec/jupyter-docker 6○

provides a Jupyter notebook with tensorflow-stack,

6○https://hub.docker.com/r/vhtec/jupyter-docker, July 2019.

Page 12: Semi-Supervised LearningBased TagRecommendation for Docker ...jcst.ict.ac.cn/fileup/1000-9000/PDF/2019-5-3-9523.pdf · Docker[1] is a well-known open source container engine that

968 J. Comput. Sci. & Technol., Sept. 2019, Vol.34, No.5

PHP kernel, JavaScript kernel, C++ kernel and Bash

kernel. For this repository, SemiTagRec predicts

seven tags, “jupyter-notebook”, “php”, “tensorflow”,

“javascript”, “ssl”, “python” and “ipython”. All of these

tags are considered reasonable by the participants.

Case 2. Docker repository chriamue/openpose 7○

provides an OpenPose image. OpenPose is a library for

real-time multi-person keypoint detection and multi-

threading written in C++ using OpenCV and Caffe.

For this repository, SemiTagRec predicts nine tags, and

six ones are reasonable and three ones are unreasonable.

The reasonable tags are “cuda”, “opencv”, “python”,

“tensorflow”, “3d” and “deep-learning”, among which

“deep-learning” is a latent tag that never occurs in the

description.

Case 3. Docker repository mitsutaka/mediaproxy-

relay 8○ provides a mediaproxy-relay image. For this

repository, SemiTagRec predicts five tags, “emoji”,

“postgres”, “kibana”, “dotnetcore” and “relay”, and only

“relay” is thought reasonable. We inspect the repository

and find that its description is very short. As such,

SemiTagRec is limited to recommend high-quality tag

for it.

5.5 Threats to Validity

Several threats that may potentially affect the vali-

dity of our work are discussed as follows.

Threats to internal validity come from several as-

pects. The first threat is about using GitHub topics as

tags. We investigate a large number (88 226) of Docker

repositories and find that the majority of them (81 233)

have the corresponding GitHub-based code reposito-

ries. Furthermore, we manually examine the GitHub

topics, and find that they can always be used to de-

scribe the corresponding Docker repositories too. In

consequence, it is reasonable to use GitHub topics. The

second threat comes from the quality of the GitHub

topics. We mitigate the threat by only using the most

popular GitHub topics as tags. We make the statis-

tics of the crawled GitHub topics and find that only

a very small proportion of the topics occur frequently.

As such, we use the top 1 000 most popular topics to

form the tag vocabulary, which can ensure the quality

of the tags. The third threat is the implementation of

the different approaches compared for answering RQ3.

To ensure the fairness of the comparison, if a specific

model is used by different approaches, we use the same

implementation of the model in all of the approaches.

For example, L-LDA model is also employed in EnTa-

gRec, and the code which implements the L-LDA in En-

TagRec is exactly the same with that in SemiTagRec,

which would reduce the bias introduced by different im-

plementations in the experimental results.

External validity concerns the generality of our

work. We crawl a large number of Docker repositories

(88 226) from DockerHub, the most popular community

specialized for Docker, which ensures that the data we

use is popular and representative. Furthermore, the

tags are derived from a large number of GitHub code

repositories and the tag vocabulary of the predictor can

be extended during the iterative training, which can

handle the diversity of Docker repositories.

Construct validity refers to the suitability of our

evaluation measures. Similar to the existing related stu-

dies, we use Recall@5 and Recall@10 as our evaluation

metrics. In addition, we conduct an evaluation with 11

participants, who have rich expertise in software engi-

neering, to evaluate the quality of recommended tags.

The knowledge background of the participants also en-

sures the validity of the evaluation result.

6 Related Work

6.1 Docker

In Docker field, there are studies addressing the

problems of security, quality, software evolution, etc.

In security aspect, Shu et al.[19] proposed a scalable

Docker image vulnerability analysis framework (DIVA)

that automatically discovers and analyzes both official

and community images on Docker Hub. Manu et al.[20]

proposed a work that helps to assess the security de-

sign and architecture quality using a multilateral se-

curity framework for the Docker container. Catuogno

and Galdi[21] considered two models for defining secu-

rity properties for container-based virtualization sys-

tems. A study[22] addresses the problems of how out-

dated container packages are and how the problems re-

late to the presence of bugs and severity vulnerabilities,

by empirically analyzing technical lag, security vulner-

abilities and bugs for Docker images.

As for software evolution and update, RUDSEA[23]

extracts software code changes between two versions

and analyzes their impacts on the software environ-

ment. According to the analysis, it recommends Dock-

7○https://hub.docker.com/r/chriamue/openpose, July 2019.8○https://hub.docker.com/r/mitsutaka/mediaproxy-relay, July 2019.

Page 13: Semi-Supervised LearningBased TagRecommendation for Docker ...jcst.ict.ac.cn/fileup/1000-9000/PDF/2019-5-3-9523.pdf · Docker[1] is a well-known open source container engine that

Wei Chen et al.: Semi-Supervised Learning Based Tag Recommendation for Docker Repositories 969

erfile item updates. Zhang et al.[24] conducted an em-

pirical study to analyze the impact of Dockerfile evolu-

tionary trajectory on the quality and latency of Docker-

based containerization. They found six Dockerfile evo-

lution trajectories and made a number of suggestions

for practitioners.

In the aspect of Docker ecosystem study, Cito et

al.[25] conducted an empirical study to characterize

the Docker container ecosystem. They discovered the

prevalent quality issues and studied the evolution of

Docker images. In addition, to lay the groundwork for

research on Docker, they collected structured informa-

tion about the state and the evolution of Dockerfiles

on GitHub and released it as a PostgreSQL database

archive[26].

Besides, the NIER (New Ideas and Emerging Re-

sults) work[4] introduces the idea of mining container

image repositories and showcases the opportunities that

can benefit from it.

Overall, the goal of our work is totally different from

the existing ones. Rather than focusing on software

maintenance and evolution, SemiTagRec aims to sup-

port Docker repository reuse by providing semantic tags

for developers.

6.2 Tag Recommendation

Tag recommendation, being a popular way to cate-

gorize and search online content, has been widely

studied and used in online applications. In soft-

ware engineering field, this technique has been used

to retrieve and reuse the software artifacts in soft-

ware information sites, such as online Q&A (Stack

Overflow[5,7], AskUbuntu[7,10], AskDifferent[7,10]) and

software repositories (Freecode[5,7,10], GitHub[27,28]).

Currently, content-based tag recommendation is popu-

lar. Most of the existing approaches take text in-

formation (particularly including online profiles, com-

ments and readme files) and code (particularly includ-

ing source code, bytecode and API) as input, and use

machine learning algorithms to perform tag recommen-

dations.

TAGREC[29], based on the fuzzy set theory, is pro-

posed to automatically recommend tags for work items

in IBM Jazz. TagCombine[5] and EnTagRec[7] ab-

stract tag recommendation as a multi-label learning

problem[11,30]. TagCombine is a composite method that

employs multi-label ranking, similarity-based ranking

and tag-term based ranking together to predict the like-

lihood of a tag to be assigned to a software object.

EnTagRec makes some improvements on TagCombine.

It combines Bayesian inference based method and fre-

quentist inference based method together, where the

former computes the probability of recommending a tag

to a certain software object based on L-LDA algorithm,

and the latter takes into account the number of words

that appear along with the tag in software objects in

a training set. GRETA[27] is a graph-based approach

to assigning tags for repositories on GitHub. GRETA

constructs an entity-tag graph (ETG) for GitHub us-

ing the domain knowledge from Stack Overflow, and

then it assigns tags for repositories by taking a ran-

dom walk algorithm. Sally[31] is a tagging approach for

Maven-based software repositories. It is able to produce

tags by extracting identifiers from bytecode and har-

nessing the dependency relations between repositories.

TagMulRec[10] is an approach that recommends tags

for software objects. It creates the index for software

object descriptions and recommends tags based on simi-

larity computation and multi-classification of software

objects. FastTagRec[32] is an automated scalable tag

recommendation method using neural network based

classification. It accurately infers new tags for post-

ings in Q&A sites, by learning existing postings and

their tags from existing information. Repo-Topix[28]

generates topics from natural language text, including

repository names, descriptions, and READMEs.

Most of the above studies are based on a large vol-

ume of training data and use supervised learning based

methods. However, the effectiveness of these studies

greatly suffers in a cold start scenario in which those

initial tags are absent[33]. Some approaches are pro-

posed to address this challenge. Specifically, a study[34]

proposes syntactic and neighborhood-based attributes

to extend and improve tag recommendation methods.

It investigates syntactic patterns that can be exploited

to identify and recommend tags. SemiTagRec is diffe-

rent from the work. It is based on a semi-supervised

learning based method rather than the syntactic pat-

tern based one.

The most related work is D-Tagger[9], which recom-

mends tags for Docker repositories based on supervised

learning. D-Tagger generates a large amount of training

data based on the LR algorithm, which is trained with

manually labeled data. With the generated training

data, D-Tagger then recommends tags for the untagged

Docker repositories based on the L-LDA model. Com-

pared with SemiTagRec, we find the limitations of D-

Tagger are listed as follows. 1) The quality of generated

training data is not guaranteed, which will further affect

Page 14: Semi-Supervised LearningBased TagRecommendation for Docker ...jcst.ict.ac.cn/fileup/1000-9000/PDF/2019-5-3-9523.pdf · Docker[1] is a well-known open source container engine that

970 J. Comput. Sci. & Technol., Sept. 2019, Vol.34, No.5

the overall recommendation performance. 2) There are

many configuration parameters in D-Tagger, and tun-

ing the parameters is cumbersome and time-consuming,

which makes D-Tagger hard to use in practice.

We experimentally compare SemiTagRec with some

related work, including D-Tagger and EnTagRec. The

experimental results show that SemiTagRec has better

performance than that of the other two, in terms of

recommendation accuracy.

7 Conclusions

This paper proposes a semi-supervised learning-

based tag recommendation approach, SemiTagRec, for

Docker repositories. SemiTagRec incrementally gene-

rates the training data and extends the vocabulary,

and it is capable of improving tagging accuracy, by

self-adapting the model iteratively. We compared it

with other related work, such as D-Tagger and EnTag-

Rec. The experimental results showed that SemiTag-

Rec outperforms the other approaches. The accuracy

of SemiTagRec, in terms of Recall@5 and Recall@10, is

0.688 and 0.781 respectively.

In future work, we plan to explore the way of im-

proving the search and reuse of Docker repositories with

semantic tags. Furthermore, in order to help developers

effectively and correctly create Docker artifacts, we plan

to empirically investigate the quality issues of Docker

artifacts and explore the ways to solve the quality is-

sues.

Acknowledgements The authors would like to

thank anonymous reviewers for their constructive com-

ments. The authors also thank contributions of other

participants of this work.

References

[1] Merkel D. Docker: Lightweight Linux containers for con-

sistent development and deployment. Linux Journal, 2014,

2014(239): Article No. 2.

[2] Seo K T, Hwang H S, Moon I Y, Kwon O Y, Kim B J. Per-

formance comparison analysis of linux container and virtual

machine for building cloud. Advanced Science and Techno-

logy Letters, 2014, 66(2): 105-111.

[3] Hummer W, Rosenberg F, Oliveira F, Eilam T. Test-

ing idempotence for infrastructure as code. In Proc.

the 14th ACM/IFIP/USENIX International Middleware

Conference, December 2013, pp.368-388.

[4] Xu T Y, Marinov D. Mining container image repositories for

software configuration and beyond. In Proc. the 40th Inter-

national Conference on Software Engineering: New Ideas

and Emerging Results, May 2018, pp.49-52.

[5] Xia X, Lo D, Wang X Y, Zhou B. Tag recommendation in

software information sites. In Proc. the 10th IEEE Working

Conference on Mining Software Repositories, May 2013,

pp.287-296.

[6] Chen W, Xu P X, Dou W S, Wu G Q, Gao C S, Wei J. A hi-

erarchical categorization approach for configuration mana-

gement modules. In Proc. the 41st IEEE Annual Computer

Software and Applications Conference, July 2017, pp.160-

169.

[7] Wang S, Lo D, Vasilescu B, Serebrenik A. EnTagRec: An

enhanced tag recommendation system for software informa-

tion sites. In Proc. the 30th IEEE International Conference

on Software Maintenance and Evolution, September 2014,

pp.291-300.

[8] Hosmer D, Lemeshow J, Sturdivant R. Applied Logistic Re-

gression (3rd edition). John Wiley & Sons, 2013.

[9] Yin K, Zhou J H, Chen W, Wu G Q, Zhu J X, Wei J. D-

Tagger: A tag recommendation approach for Docker repos-

itories. In Proc. the 10th Asia-Pacific Symposium on Inter-

netware, September 2018, Article No. 3.

[10] Zhou P, Liu J, Yang Z J, Zhou G. Scalable tag recommenda-

tion for software information sites. In Proc. the 24th Inter-

national Conference on Software Analysis, Evolution and

Reengineering, February 2017, pp.272-282.

[11] Ramage D, Hall D, Nallapati R, Manning C. Labeled LDA:

A supervised topic model for credit attribution in multi-

labeled corpora. In Proc. the 2009 Conference on Empirical

Methods in Natural Language, August 2009, pp.248-256.

[12] David M, Andrew Y, Michael I. Latent Dirichlet allocation.

Journal of Machine Learning Research, 2003, 3: 993-1022.

[13] Zhang M, Zhou Z. A review on multi-label learning al-

gorithms. IEEE Trans. Knowledge and Data Engineering,

2014, 26(8): 1819-1837.

[14] Gousios G, Pinzger M, van Deursen A. An exploratory

study of the pull-based software development model. In

Proc. the 36th International Conference on Software En-

gineering, May 2014, pp.345-355.

[15] Bergstra J, Bengio Y. Random search for hyper-parameter

optimization. Journal of Machine Learning Research, 2012,

13: 281-305.

[16] McCallum A, Nigam K. A comparison of event mod-

els for naive Bayes text classification. In Proc. the 1998

AAAI/ICML Workshop on Learning for Text Categoriza-

tion, July 1998, pp.41-48.

[17] Denoeux T. A k-nearest neighbor classification rule based

on Dempster-Shafer theory. IEEE Transactions on Sys-

tems, Man, and Cybernetics, 1995, 25(5): 804-813.

[18] Breiman L. Random forests. Machine Learning, 2001,

45(1): 5-32.

[19] Shu R, Gu X, Enck W. A study of security vulnerabilities on

Docker hub. In Proc. the 7th ACM Conference on Data and

Application Security and Privacy, March 2017, pp.269-280.

[20] Manu A, Patel J, Akhtar S, Agrawal V, Murthy K. Docker

container security via heuristics-based multilateral security-

conceptual and pragmatic study. In Proc. the 2016 Interna-

tional Conference on Circuit, Power and Computing Tech-

nologies, March 2016, Article No. 114.

Page 15: Semi-Supervised LearningBased TagRecommendation for Docker ...jcst.ict.ac.cn/fileup/1000-9000/PDF/2019-5-3-9523.pdf · Docker[1] is a well-known open source container engine that

Wei Chen et al.: Semi-Supervised Learning Based Tag Recommendation for Docker Repositories 971

[21] Catuogno L, Galdi C. On the evaluation of security prop-

erties of containerized systems. In Proc. the 15th Interna-

tional Conference on Ubiquitous Computing and Commu-

nications and the 2016 International Symposium on Cy-

berspace and Security, December 2016, pp.69-76.

[22] Zerouali A, Mens T, Robles G, Gonzalez-Barahona J M.

On the relation between outdated Docker containers, sever-

ity vulnerabilities and bugs. In Proc. the 26th IEEE Inter-

national Conference on Software Analysis, Evolution and

Reengineering, February 2019, pp.491-501.

[23] Hassan F, Rodriguez R, Wang X. RUDSEA: Recommend-

ing updates of Dockerfiles via software environment ana-

lysis. In Proc. the 33rd ACM/IEEE International Confe-

rence on Automated Software Engineering, September

2018, pp.796-801.

[24] Zhang Y, Yin G, Wang T et al. An insight into the impact of

Dockerfile evolutionary trajectories on quality and latency.

In Proc. the 42nd IEEE Annual Computer Software and

Applications Conference, July 2018, pp.138-143.

[25] Cito J, Schermann G, Wittern J, Leitner P, Zumberi S, Gall

H. An empirical analysis of the docker container ecosystem

on Github. In Proc. the 14th International Conference on

Mining Software Repositories, May 2017, pp.323-333.

[26] Schermann G, Zumberi S, Cito J. Structured information

on state and evolution of Dockerfiles on Github. In Proc. the

15th International Conference on Mining Software Repos-

itories, May 2018, pp.26-29.

[27] Cai X, Zhu J, Shen B et al. GRETA: Graph-based tag as-

signment for Github repositories. In Proc. the 40th IEEE

Annual Computer Software and Applications Conference,

June 2016, pp.63-72.

[28] Ganesan K. Topic suggestions for millions of repositories.

https://github.blog/2017-07-31-topics/, July 2019.

[29] Al-Kofahi J M, Tamrawi A, Nguyen T T, Nguyen H A,

Nguyen T N. Fuzzy set approach for automatic tagging

in evolving software. In Proc. the 26th IEEE International

Conference on Software Maintenance, September 2010, Ar-

ticle No. 37.

[30] Gibaja E, Ventura S. A tutorial on multilabel learning.

ACM Computing Surveys, 2015, 47(3): Article No. 52.

[31] Vargas-Baldrich S, V’asquez M L, Poshyvanyk D. Auto-

mated tagging of software projects using bytecode and

dependencies (N). In Proc. the 30th IEEE/ACM Inter-

national Conference on Automated Software Engineering,

November 2015, pp.289-294.

[32] Liu J, Zhou P, Yang Z, Liu X, Grundy J. FastTagRec: Fast

tag recommendation for software information sites. Auto-

mated Software Engineering, 2018, 25(4): 675-701.

[33] Belem F, Almeida J, Goncalves M. A survey on tag recom-

mendation methods. Journal of the Association for Infor-

mation Science and Technology, 2017, 68(4): 830-844.

[34] Belem F, Heringer A G, Almeida J, Goncalves M. Exploit-

ing syntactic and neighbourhood attributes to address cold

start in tag recommendation. Information Processing and

Management, 2019, 56(3): 771-790.

Wei Chen received his Ph.D. degree

in computer software and theory from

Institute of Software, Chinese Academy

of Sciences, Beijing, in 2013. He is

currently an associate professor in

Institute of Software, Chinese Academy

of Sciences, Beijing. He is a member

of CCF. His research interests include

service-oriented computing, cloud computing and DevOps.

Jia-Hong Zhou received his Bache-

lor’s degree in software engineering from

Nankai University, Tianjin, in 2017.

He is currently a Master student at

Institute of Software, Chinese Academy

of Sciences, Beijing. His research

interests include machine learning and

knowledge graph.

Jia-Xin Zhu received his Ph.D.

degree in computer software and theory

from Peking University, Beijing, in 2017.

He is an assistant research professor in

Institute of Software, Chinese Academy

of Sciences, Beijing. He is a member

of CCF. He is interested in improving

software and its development through

the advanced software measurement from both social and

technical perspectives.

Guo-Quan Wu received his Ph.D.

degree in computer software and the-

ory from University of Science and

Technology of China, Hefei, in 2009.

He was a visiting scholar in the School

of Computer Science, Georgia Institute

of Technology, in 2013–2014. He is

an associate professor in Institute of

Software, Chinese Academy of Sciences, Beijing. He is a

member of CCF. His research interests include service-

oriented computing, web-based software, software testing

and dynamic analysis.

Jun Wei received his Ph.D. degree

in computer science from the Wuhan

University, Wuhan, in 1997. He was

a visiting researcher in Hong Kong

University of Science and Technology,

Hong Kong, in 2000. He is a professor in

Institute of Software, Chinese Academy

of Sciences, Beijing. He is a member

of CCF. His area of research is software engineering

and distributed computing, with emphasis on middle

ware-based distributed software engineering.