me/mtech projects

12
SOS: A Distributed Mobile Q&A System Based on Social Networks Haiying Shen, Senior Member, IEEE, Ze Li, Student Member, IEEE, Guoxin Liu, Student Member, IEEE, and Jin Li, Fellow, IEEE Abstract—Recently, emerging research efforts have been focused on question and answer (Q&A) systems based on social networks. The social-based Q&A systems can answer non-factual questions, which cannot be easily resolved by web search engines. These systems either rely on a centralized server for identifying friends based on social information or broadcast a user’s questions to all of its friends. Mobile Q&A systems, where mobile nodes access the Q&A systems through Internet, are very promising considering the rapid increase of mobile users and the convenience of practical use. However, such systems cannot directly use the previous centralized methods or broadcasting methods, which generate high cost of mobile Internet access, node overload, and high server bandwidth cost with the tremendous number of mobile users. We propose a distributed Social-based mObile Q&A System (SOS) with low overhead and system cost as well as quick response to question askers. SOS enables mobile users to forward questions to potential answerers in their friend lists in a decentralized manner for a number of hops before resorting to the server. It leverages lightweight knowledge engineering techniques to accurately identify friends who are able to and willing to answer questions, thus reducing the search and computation costs of mobile nodes. The trace-driven simulation results show that SOS can achieve a high query precision and recall rate, a short response latency and low overhead. We have also deployed a pilot version of SOS for use in a small group in Clemson University. The feedback from the users shows that SOS can provide high-quality answers. Index Terms—Question and answer systems, online social networks, peer to peer networks Ç 1 INTRODUCTION T RADITIONAL search engines such as Google [1] and Bing [2] are the primary way for information retrieval on the Internet. To improve the performance of search engines, social search engines [3], [4], [5], [6], [7], [8], [9], [10] have been proposed to determine the results searched by key- words that are more relevant to the searchers. These social search engines group people with similar interests and refer to the historical selected results of a person’s group mem- bers to decide the relevant results for the person. Although the search engines perform well in answer- ing factual queries for information already in a database, they are not suitable for non-factual queries that are more subjective, relative and multi-dimensional (e.g., can any- one recommend a professor in advising research on social-based question and answer (Q&A) systems?), espe- cially when the information is not in the database (e.g., suggestions, recommendations, advices). One method to solve this problem is to forward the non-factual queries to humans, which are the most “intelligent machines” that are capable of parsing, interpreting and answering the queries, provided they are familiar with the queries. Accordingly, a number of expertise location systems [11], [12], [13], [14] have been proposed to search experts in social networks or Internet aided by a centralized search engine. Also, web Q&A sites such as Yahoo!Answers [15] and Ask.com [16] provide high-quality answers [17] and have been increasingly popular. To enhance the asker satisfaction on the Q&A sites, recently, emerging research efforts have been focused on social network based Q&A systems [17], [18], [19], [20], [21], [22], in which users post and answer questions through social network maintained in a centralized server. As the answerers in the social network know the backgrounds and preference of the askers, they are willing and able to pro- vide more tailored and personalized answers to the askers. The social-based Q&A systems can be classified into two categories: broadcasting-based [17], [18], [19] and central- ized [20], [21], [22]. The broadcasting-based systems broad- cast the questions of a user to all of the user’s friends. In the centralized systems [20], [21], [22], since the centralized server constructs and maintains the social network of each user, it searches the potential answerers for a given question from the asker’s friends, friends of friends and so on. In respect to the client side, the rapid prevalence of smartphones has boosted mobile Internet access, which makes the mobile Q&A system a very promising applica- tion. The number of mobile users who access Twitter [23] increased 182 percent from 14.28 million in January 2010 to 26 million in January 2011. It was estimated that Internet browser-equipped phones will surpass 1.82 billion units by 2013, eclipsing the total of 1.78 billion PCs by then [24]. The mobile Q&A systems enable users to ask and answer questions anytime and anywhere at their fingertips. How- ever, the previous broadcasting and centralized methods are not suitable to the mobile environment, where each mobile node has limited resources. Broadcasting questions to all friends of a user generates a high computing H. Shen, Z. Li, and G. Liu are with the Department of Electrical and Com- puter Engineering, 313-B Riggs Hall, Clemson University, Clemson, 29634 SC. E-mail: {shenh, zel, guoxinl}@clemson.edu. J. Li is with the Microsoft Research, One Microsoft Way, Redmond, 98052 WA. E-mail: [email protected]. Manuscript received 10 Oct. 2012; revised 9 Mar. 2013; accepted 30 Apr. 2013; date of publication 12 May 2013; date of current version 21 Feb. 2014. Recommended for acceptance by D. Xuan. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference the Digital Object Identifier below. Digital Object Identifier no. 10.1109/TPDS.2013.130 1066 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 25, NO. 4, APRIL 2014 1045-9219 ß 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Upload: ranjith-kumar

Post on 21-Jun-2015

185 views

Category:

Engineering


1 download

DESCRIPTION

Various supports: Base paper, Abstract, All review presentation (Including UML diagrams) Lab Practice (min 10 to 15 classes) final document (including-- video output, screen shorts) With Regards...* *Ranjith kumaR. K* TRIPLE N INFOTECH 73/5,3rd FLOOR,SRI KAMATCHI COMPLEX OPP.CITY HOSPITAL (NEAR LAKSHMI COMPLEX) SALAI ROAD,Trichy - 620 018, Ph: 0431 4050403, 7200021403, 7200021404.

TRANSCRIPT

Page 1: ME/Mtech projects

SOS: A Distributed Mobile Q&A SystemBased on Social Networks

Haiying Shen, Senior Member, IEEE, Ze Li, Student Member, IEEE,

Guoxin Liu, Student Member, IEEE, and Jin Li, Fellow, IEEE

Abstract—Recently, emerging research efforts have been focused on question and answer (Q&A) systems based on social networks.

The social-based Q&A systems can answer non-factual questions, which cannot be easily resolved by web search engines. These

systems either rely on a centralized server for identifying friends based on social information or broadcast a user’s questions to all of its

friends. Mobile Q&A systems, where mobile nodes access the Q&A systems through Internet, are very promising considering the rapid

increase of mobile users and the convenience of practical use. However, such systems cannot directly use the previous centralized

methods or broadcasting methods, which generate high cost of mobile Internet access, node overload, and high server bandwidth cost

with the tremendous number of mobile users. We propose a distributed Social-based mObile Q&A System (SOS) with low overhead

and system cost as well as quick response to question askers. SOS enables mobile users to forward questions to potential answerers

in their friend lists in a decentralized manner for a number of hops before resorting to the server. It leverages lightweight knowledge

engineering techniques to accurately identify friends who are able to and willing to answer questions, thus reducing the search and

computation costs of mobile nodes. The trace-driven simulation results show that SOS can achieve a high query precision and recall

rate, a short response latency and low overhead. We have also deployed a pilot version of SOS for use in a small group in Clemson

University. The feedback from the users shows that SOS can provide high-quality answers.

Index Terms—Question and answer systems, online social networks, peer to peer networks

Ç

1 INTRODUCTION

TRADITIONAL search engines such as Google [1] and Bing[2] are the primary way for information retrieval on the

Internet. To improve the performance of search engines,social search engines [3], [4], [5], [6], [7], [8], [9], [10] havebeen proposed to determine the results searched by key-words that are more relevant to the searchers. These socialsearch engines group people with similar interests and referto the historical selected results of a person’s group mem-bers to decide the relevant results for the person.

Although the search engines perform well in answer-ing factual queries for information already in a database,they are not suitable for non-factual queries that are moresubjective, relative and multi-dimensional (e.g., can any-one recommend a professor in advising research onsocial-based question and answer (Q&A) systems?), espe-cially when the information is not in the database (e.g.,suggestions, recommendations, advices). One method tosolve this problem is to forward the non-factual queriesto humans, which are the most “intelligent machines”that are capable of parsing, interpreting and answeringthe queries, provided they are familiar with the queries.Accordingly, a number of expertise location systems [11],[12], [13], [14] have been proposed to search experts in

social networks or Internet aided by a centralized searchengine. Also, web Q&A sites such as Yahoo!Answers [15]and Ask.com [16] provide high-quality answers [17] andhave been increasingly popular.

To enhance the asker satisfaction on the Q&A sites,recently, emerging research efforts have been focused onsocial network based Q&A systems [17], [18], [19], [20], [21],[22], in which users post and answer questions throughsocial network maintained in a centralized server. As theanswerers in the social network know the backgrounds andpreference of the askers, they are willing and able to pro-vide more tailored and personalized answers to the askers.The social-based Q&A systems can be classified into twocategories: broadcasting-based [17], [18], [19] and central-ized [20], [21], [22]. The broadcasting-based systems broad-cast the questions of a user to all of the user’s friends. In thecentralized systems [20], [21], [22], since the centralizedserver constructs and maintains the social network of eachuser, it searches the potential answerers for a given questionfrom the asker’s friends, friends of friends and so on.

In respect to the client side, the rapid prevalence ofsmartphones has boosted mobile Internet access, whichmakes the mobile Q&A system a very promising applica-tion. The number of mobile users who access Twitter [23]increased 182 percent from 14.28 million in January 2010 to26 million in January 2011. It was estimated that Internetbrowser-equipped phones will surpass 1.82 billion unitsby 2013, eclipsing the total of 1.78 billion PCs by then [24].The mobile Q&A systems enable users to ask and answerquestions anytime and anywhere at their fingertips. How-ever, the previous broadcasting and centralized methodsare not suitable to the mobile environment, where eachmobile node has limited resources. Broadcasting questionsto all friends of a user generates a high computing

� H. Shen, Z. Li, and G. Liu are with the Department of Electrical and Com-puter Engineering, 313-B Riggs Hall, Clemson University, Clemson,29634 SC. E-mail: {shenh, zel, guoxinl}@clemson.edu.

� J. Li is with the Microsoft Research, One Microsoft Way, Redmond, 98052WA. E-mail: [email protected].

Manuscript received 10 Oct. 2012; revised 9 Mar. 2013; accepted 30 Apr.2013; date of publication 12 May 2013; date of current version 21 Feb. 2014.Recommended for acceptance by D. Xuan.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TPDS.2013.130

1066 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 25, NO. 4, APRIL 2014

1045-9219 � 2014 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

Page 2: ME/Mtech projects

overhead to the friends. Also, it results in many costlyinterruptions to users by sending questions that they can-not answer and increase their workload of looking forquestions that they can answer through a pool of receivedquestions. Further, broadcasting to a large number offriends cannot guarantee the quality of the answers. Thecentralized methods, by serving a social network consist-ing of hundreds of millions of mobile users (which are alsorapidly increasing), suffer from high cost of mobile Inter-net access for clients, high query congestion, and highserver bandwidth and maintenance costs. It was reportedthat Facebook spent more than $15 million per year forserver bandwidth costs and data center rental in additionto $100 million for purchasing 50,000 servers to relieve thehigh burden of traffic [25].

To tackle the problems in the previous social-based Q&Asystems and to realize a mobile Q&A system, a key hurdleto overcome is: How can a node identify friends most likely toanswer questions in a distributed fashion? To this problem, inthis paper, we propose a distributed Social-based mObileQ&A System (SOS) with low node overhead and systemcost as well as quick response to question askers. SOS isnovel in that it achieves lightweight distributed answerersearch, while still enabling a node to accurately identify itsfriends that can answer a question. We have also deployeda pilot version of SOS for use in a small group in ClemsonUniversity. The analytical results of the data from the realapplication show the highly satisfying Q&A service andhigh performance of SOS.

SOS leverages the lightweight knowledge engineeringtechniques to transform users’ social information andcloseness, as well as questions to IDs, respectively, so thata node can locally and accurately identify its friends capa-ble of answering a given question by mapping the ques-tion’s ID with the social IDs. The node then forwards thequestion to the identified friends in a decentralized man-ner. After receiving a question, the users answer the ques-tions if they can or forward the question to their friends.The question is forwarded along friend social links for anumber of hops, and then to the server. The cornerstoneof SOS is that a person usually issues a question that isclosely related to his/her social life. As people sharingsimilar interests are likely to be clustered in the social net-work [26], [27], the social network can be regarded associal interest clusters intersecting with each other. Bylocally choosing the most potential answerers in a node’sfriend list, the queries can be finally forwarded to thesocial clusters that have answers for the question. As theanswerers are socially close to the askers, they are morewilling to answer the questions compared to strangers inthe Q&A websites. In addition, their answers are alsomore personalized and trustable [28].

In a nutshell, SOS is featured by three advantages:(1) Decentralized. Rather than relying on a centralized

server, each node identifies the potential answerers from itsfriends, thus avoiding the query congestion and high serverbandwidth and maintenance cost problem.

(2) Low cost. Rather than broadcasting a question to all ofits friends, an asker identifies the potential answerers whoare very likely to answer this question, thus reducing thenode overhead, traffic and mobile Internet access.

(3) Quick response. An asker identifies potential answerersfrom his/her friends based on their past answer quality andanswering activeness to his/her questions.

The contributions of this work are summarized as follows:

1. As far as we know, it is the first work to design a dis-tributed Q&A mobile system based on social net-works, which can be extended to low-end mobiledevices. The system can tackle the formidable chal-lenge facing distributed systems: precise answereridentification.

2. We propose a method that leverages lightweightknowledge engineering techniques for accurateanswerer identification.

3. We use answer quality to represent both the willing-ness of a node to answer another node’s questionsand the quality of its answers. We propose a methodthat considers both interest similarity and answerquality based on past experience in question for-warder selection in order to increase the likelihoodof the receiver to answer/forward the question.

4. We have studied our crawled data fromYahoo!Answer and Twitter with regards to nodeinteractions in online Q&A systems and onlinesocial networks. We then conducted extensivetrace-driven simulations based on the crawleddata. Experimental results show the high answereridentification accuracy, low cost and shortresponse delay of SOS.

5. We have deployed a pilot version of SOS for use in asmall group in Clemson University and revealedinteresting findings in the mobile social-based Q&Asystem. Though Google earns a little higher user sat-isfaction degree than SOS on factual questions, SOSgains much higher satisfaction degree for non-factualquestions than Google. Also, socially close users tendto respond questions quickly (in supplemental mate-rial, which can be found on the Computer SocietyDigital Library at http://doi.ieeecomputersociety.org/10.1109/ TPDS.2013.130).

Note that SOS still has a centralized server to supportQ&A activities for questions that are difficult to findanswerers in the user social network. SOS also can collectprevious questions and answers in the centralized server toimprove the Q&A system performance. This journal versionpresents more comprehensive experimental results com-pared to its conference version [29]. The remainder of thepaper is organized as follows. Section 2 and Section 3 pres-ent the trace data and the design of SOS. Section 4 presentsthe trace-driven simulation results. We conclude this paperwith remarks on future work in Section 5. The supplementalmaterial, available online, presents an overview of relatedwork and additional experimental results for the effective-ness of SOS’s feedback mechanism, and for our imple-mented SOS prototype.

2 DATA STUDY FOR POTENTIAL ADVANTAGES

OF SOCIAL-BASED Q&A

In order to study the features of people interactions inonline Q&A sites and social networks, we crawled 9;419

SHEN ET AL.: SOS: A DISTRIBUTED MOBILE Q&A SYSTEM BASED ON SOCIAL NETWORKS 1067

Page 3: ME/Mtech projects

questions posted in the “Entertainment & Music–Movies”section in Yahoo!Answer and 7;559 tweets from Twitter.Since Twitter is not designed to ask questions, we choseuser ReadWriteWeb [30] that has a large number of fol-lowers in order to find more tweets and user tweeting/replyinteractions. Specifically, we crawled 2;559 tweets postedby user ReadWriteWeb and all his/her followers andfollowers’ followers that do not have @username and 5;000tweets that have @username between October 5, 2010 andOctober 26, 2010. In Twitter, a user A can specifically notifyanother user B about a tweet by adding B’s @username inthe tweet. User B will then receive a notification messagefrom the system about the tweet. In the figures below, weuse Twitter-AT to denote the tweets with @username anduse Twitter to denote tweets without @username.

First, we analyze the satisfaction of Yahoo!Answersusers based on their feedbacks. In Yahoo!Answers, anasker is allowed to rate an answer with rating stars 1-5.Fig. 1 shows the histogram of all the askers’ satisfactionwith all answers in the data set. We can see that 32 per-cent of the answerers receive five stars and 8 percent ofthe answerers receive four stars. These answers take up80 percent of the answers that receive ratings. The resultconforms to the observations in the previous research [17]that the answers provided in Yahoo!Answers are quitesatisfying. However, nearly 50 percent of answers did notreceive ratings. We suspect that one reason is because theusers in Yahoo!Answers are not closely tied, and theymay visit Yahoo!Answers only when they need to askquestions. Also, users may not take the questions oranswers very seriously. Some askers may even forget tocheck the answers. If a question is not answered when itis in the first few pages, it may never be answered later.

Fig. 2 further shows the cumulative distribution function(CDF) of the number of replies for the replied questions inYahoo!Answers and all crawled tweets in Twitter. InYahoo!Answers, nearly 73 percent of all the questionsreceive more than one response, in contrast to Twitter,

where more than 95 percent tweets receive more than oneresponse. Also, less than 20 percent of all the questionsreceive more than five answers in Yahoo!Answers, in con-trast to Twitter, where more than 80 percent of the tweetsreceive more than five responses. We suspect that one rea-son is that the users in Yahoo!Answer do not have socialrelationship, thus they may not take others’ questions asseriously as in social networks. Also, too many questionsare posted on the forum every day, so it is not easy for a per-son who may be able to answer the question to find thequestions in the first place. In Twitter, the tweets of usersare only pushed to their friends that are connected by theirinterests and social relationships. Therefore, it is more likelyfor them to interact with each other. Common interest, closesocial relationship and frequent contact motivate a user’sfriends to answer his/her questions. We can also see fromthe figure that about 30 percent of the tweets with @ receiveresponses less than five, while 20 percent of the tweets with-out @ receive responses less than five. This is because thetweets that target to a specific user will not attract discus-sions from other users, leading to a decreased number ofresponses. However, as the figure shows the users with @in Twitter are more likely to respond to each other than theYahoo!Answer. In Yahoo!Answer, nearly 73 percent of allthe questions receive more than one response while in Twit-ter with @, more than 90 percent of the users receive morethan one response. This is because if a node A sends a tweetspecifically to another node B, node B is very likely to replythe tweet to node A. In Yahoo!Answer, users do not knoweach other and may not have the same incentives to replyquestions as in Twitter.

Fig. 3 shows the CDF of the average response time for thequestions that are rated in Yahoo!Answers and tweets inTwitter. We can see that less than 30 percent of the questionsrated in Yahoo!Answers receive answers in less than 15minutes. In contrast, in Twitter, more than 50 percent of thequestions are responded within 15 minutes. This resultshows the advantages of shorter response time in social-based Q&A compared to Yahoo!Answers. In social net-works, as users are connected by their interests and socialrelationships, they are more willing to interact with eachother, resulting in a low response delay. We can also seefrom the figure that in Twitter-AT, more than 60 percent ofthe questions are responded within 15 minutes, the percent-age of which is higher than tweets without @. This isbecause when a user is mentioned in a tweet, s(he) is morelikely to respond as s(he) knows that the sender is expectingthe reply from him/her. Considering the social tie betweenthem, the receiver is very likely to respond the tweet in ashort time.

Fig. 1. Askers’ satisfaction on the answers.

Fig. 2. Number of replies for questions.

Fig. 3. CDF of average response time.

1068 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 25, NO. 4, APRIL 2014

Page 4: ME/Mtech projects

Summary. The figures show that questions posted inonline Q&A sites are likely to receive few responses withlong delay, though they are a good channel to inquire infor-mation. Similar result is also found in [17], which showsthat the latency for receiving a satisfying answer in anonline Q&A site is high with the average equals 2:52:30(hh:mm:ss) even when the number of the registered users isvery large (290,000). This is because anonymous users in aQ&A site do not have social relationship between eachother, so they may not have incentives to answer others’questions. By leveraging the close social relationship andinterest similarity properties among friends in social net-works, social-based Q&A systems can help to overcome theinherent problems in online Q&A sites with high responserate and low response delay, since people with similar inter-ests or close social relationship are likely to interact witheach other, especially when a user specifically sends a tweetto another user.

3 SYSTEM DESIGN

3.1 Question Routing

SOS incorporates an online social network, where nodes con-nect each other by their social links. As shown in Fig. 4, a reg-istration server is responsible for user registration. Each userhas an interest ID, which represents his/her interest. Userssharing more common interests with an asker’s question aremore likely to be able to answer the question. Also, userswho have been willing to answer questions and providedhigh-quality answers (measured by answer quality) to nodei’s questions previously are more likely to be willing toanswer node i’s questions and provide high-quality answers.Thus, SOS has a metric best answerer (BAðqi;jÞ) that measuresthe likelihood of node j to be able and willing to answernode i’s question qi with a high-quality answer. It is deter-mined by the interest similarity (Sðqi;jÞ) between the questionqi’s interest and node j’s interest as well as the answer quality(Qði;jÞ) of node j to node i’s previous questions.

In SOS, a node sends or forwards a question to K friendsin its friend list. Each question can be forwarded for TTLhops. A larger TTL increases the probability to find ananswerer in the distributed manner but generates higherforwarding overhead. Determining an appropriate TTL is anon-trivial problem and we leave it as our future work.

Fig. 4 shows the question routing process in SOS. In thefigure, each user is associated with an interest ID. Asker Aforwards its question to the top K friends (nodes B and C)who have the highest best answerer ba values in its friend listwith the question. A question receiver replies to A if it hasan answer. Otherwise, it forwards the question to its top Kfriends in its own friend list in the same manner (B to D)and reduces TTL by 1. The question is forwarded alongnode social links until TTL ¼ 0. If the question initiator hasnot received an answer after delay corresponding to TTL, itsends the question to the server that holds a discussionboard, which can be accessed by all users in the system. Thediscussion board serves as a store for unsolved questions inthe distributed system. Then, the questions in the discussionboard are handled as in online Q&A systems. From this pro-cess, we can see that three problems need to be resolved.

� How to derive the topics (i.e., interests) of a questionor a user (Section 3.2)?

� How to infer additional interests from a questionand a user for more accurate answerer identification(Section 3.3)? For example, from “Tom is a male CSstudent who likes reading book,” we can infer “Tomlikes fiction” so that he can be identified as theanswerer for the question “who is the author of TheLord of the Rings?” Without inference, Tom may notbe identified as an answerer for the question.

� How to locate K friends in the friend list by consid-ering both interest similarity and answer quality(Section 3.4)?

Below, we introduce the solutions for these problems.

3.2 Question/User Interest Representation

When a user first uses the SOS system, s(he) is required tocomplete his/her social profile such as interests, profes-sional background and so on. Based on the social infor-mation, the registration server recommends friends to theuser, and the user then adds friends into his/her friendlist. Fig. 5 shows a simple example of social network,where users A, B and C are connected with each other bytheir social relationships. Each user locally stores her/hisown profile and interest ID, and her/his friend list andtheir interest IDs and answer quality values. Each user cal-culates his/her own interest ID based on his/her socialinformation and sends it to his/her friends. To calculateinterest ID, as shown on the right part of Fig. 6, a node

Fig. 4. Querying process in SOS.

Fig. 5. An example of a node’s social network.

SHEN ET AL.: SOS: A DISTRIBUTED MOBILE Q&A SYSTEM BASED ON SOCIAL NETWORKS 1069

Page 5: ME/Mtech projects

first derives the first-order logic representation (FOL) [31]from its social information, then conducts first-order logicinference to infer its interests, from which it decides theinterest ID.

The left part of Fig. 6 shows the local answerer selec-tion process for forwarding a question in one mobilenode in the SOS system. To parse a question, the nodefirst processes the question using natural language proc-essing (NLP), and then represents the question in theFOL format and uses the FOL inference to infer thequestion’s interests. Finally, it transforms the question toa question ID in the form of a numerical string. Afternode i parses its initiated question qi to a question ID, itcalculates interest similarity Sðqi;jÞ for each of its friendsj 2 F i, where F i denotes the set of node i’s friends. Itthen calculates the best answerer value (BAðqi; jÞ) for eachfriend j by combining Sðqi;jÞ and answer quality fromfriend j (Qði;jÞ). We will explain the details of calculationin Section 3.4. Finally, node i chooses top K friends thathave the highest BAðqi;jÞ values to send the question.

For instance, an asker may ask a question “Where is thebest place to watch the movie Avatar in Clemson?”. The cor-responding keyword list of this question is resolved to theFOL format [where, place, movie, Avatar, Clemson] afternatural language processing. After the FOL inference, theFOL format is changed to [movie(sci-fi), director(JamesCameron), place(Clemson)], which is subsequently encodedas a numerical string such as 3200001000. Similarly, a stu-dent in Clemson University who likes to watch sci-fi movieis represented as [movie(sci-fi), career(student), place(clem-son)] after the FOL inference and be further encoded asinterest ID 3202001001. By comparing the similaritybetween a question’s ID and its friend’s interest ID, a nodecan identify its friends that are able to answer questions.More details of the parsing process for a question or for auser is demonstrated in Figs. 7 and 8, respectively. The fig-ures list the three steps in the process: FOL representation,FOL inference, and ID transformation. Below, we introducethe details of the three steps.

3.2.1 Preliminary of the First-Order Logic

FOL is a powerful tool to describe objects and their relation-ships in real life. FOL has basic rules or axioms, which serveas the base of the inference. For example, the FOL for an

axiom in natural language “All computer science (CS) malestudents who like reading like sci-fi movies” is

ð8x; yÞðCS(x) ^male(x) ^Activity(Reading)) like(y)Þ;

where “CS(x)”, “male(x)”, “Activity(Reading)”, and “like(Sci-Fi)” are predicate symbols, and ^ is connectives symbol. In anFOL representation, connectives symbols (e.g., �, ^) andquantifiers logically connect constant symbols, predicate sym-bols which map from individuals to truth values (e.g., green(Grass)) and function symbols which map individuals to indi-viduals (e.g., father-of(Mary)=John). These symbols repre-sent objects (e.g., people, houses, numbers), relations (e.g, red,is inside) and functions (e.g., father of, best friend),respectively.

3.2.2 First-Order Logic Representation

A question or user profile information is always expressedin the natural language. To convert a question or profileinformation into a format that a computer can understand,we can use part-of-speech tagging [32] or modern NLP tech-niques [33] to divide the question into a group of relatedwords expressed by words, 2-word phrases, the wh-type(e.g., “what”, “where” or “when”). Then we transformquestions into the FOL representation. First, we parse thenatural language into token keywords, which are the con-stant symbols in the FOL representations. The step 1 inFig. 7 shows an example of FOL representation of the query.The keywords of the question “Where is the best cinema inlocation A?” are “cinema” and “location A”.

Each node also transforms its social information intothe FOL representation. Specifically, a node first repre-sents its profile in the form of name:value pairs such as“movies: Avatar, The Social Network”, “music: Hey,

Fig. 6. Answerer selection for forwarding a question in one node.

Fig. 7. An example of FOL inference for a question.

Fig. 8. An example of FOL inference for a user’s social information.

1070 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 25, NO. 4, APRIL 2014

Page 6: ME/Mtech projects

Jude”. That is, each interest is indexed by a uniquename (e.g., movie, music) and can have several values.The syntax name:value is then transformed to the FOLrepresentation expressed by predicate symbols name(-value). For example, the FOL representation of theprevious example is “movie(Avatar)”,“movie(The SocialNetwork)”, “music(Hey, Jude)”. The first step in Fig. 8shows an example of FOL representation of a node’sprofile. Tom has profile information in the name:valuepair format. His favorite books are A1, A2, A3. This infor-mation is transformed into FOL representationFa bkðA1Þ, Fa bkðA2Þ and Fa bkðA3Þ.

3.3 First-Order Logic Inference

As shown in the step 2 in Figs. 7 and 8, the FOL inferencecomponent consists of three parts: (1) fuzzy database, (2)rules and axioms, (3) inference engine. The goal of the infer-ence is to identify node interests represented by a numericalstring that can accurately represent the capability of a nodeto answer questions. The fuzzy database is used to storewords that have relationships, including subset, alias(x),related, with the information in profiles. For example, Rela-ted(cinema) ¼ movie, Subset(computer science, algorithm),Alias(USA) ¼ US.

The rule and axioms provide basic formulas for the infer-ence. For example, given a rule “Major(x)^Subset(x) ¼ y)expert(y)”, if Major(x1) exists in a user’s FOL inference andSubset(x1) ¼ y1 exists in the fuzzy database, then we caninfer that the user is an expert of y1. If Tom majors in com-puter science, and Algorithm is a subset of computer sci-ence, then Tom should be good at answering questionson Algorithm.

The inference engine checks the rules and finds relatedbut not obvious information. It sets each interest as aninference goal and builds lattice inference structure, asshown in Fig. 9, to connect all the FOL symbols with thegoals. Each node in the lattice is an FOL syntax symboland the arrows represent the connective symbols that con-nect the symbols. By mapping the syntax symbols shownin the question (or social information) and fuzzy database,we can trace up from the basic elements to the final goal.For example, as shown by the gray box in Fig. 9, syntaxsymbols CS(Tom), Male(Tom) and Activity(Reading) allpoint to the goal Sci-Fi using an arrow. That is, if thesethree syntax symbols are all satisfied, we can infer thatthe goal Sci-Fi is satisfied for Tom, i.e., Tom can answerthe question about Sci-Fi. As shown in Figs. 7 and 8,after the elements pass the inference engine, the previousFOL representation is transformed to the interest tablelisting the interests of the question (or user). As shown inthe step 3 in these two figures, the interests of a question

(or user) are transformed to a numerical string to repre-sent the knowledge field of a question (or a user).

Next, we introduce how to generate question ID andinterest ID in the form of a numerical string. Fig. 10 showsan example of interest arrays. The top line lists all interestcategories in the system. The different interests in a categoryare alphabetically sorted in the category’s column. Eachinterest in a category is represented by its entry index. Forexample, “Comedy” is represented by two. Each numericalstring for a question (or user) has a series of digits (e.g.,322023150). The position of each digit corresponds to eachcategory and the value of the digit represents an interest inthe category. For example, string 322023150 denotes “Sci-Fi,Pop, Math, Lady Gaga. . .” When generating the numericalstring, each category in the table is checked. For each cate-gory, if a user/question has an interest in the category, theentry index of the interest is the digit in the correspondingposition in the numeric string. Otherwise, the category’sposition in numeric string is set to 0. If the user/question’sinterest can match any specific interest in a category, thecorresponding digit is set to1. If the user/question has sev-eral interests in a category, we use “j” to concatenate theindices of the interests in the category. For example, string1j2j322023150 denotes “ActionjComedyjSci-Fi, Pop, Math,Lady Gaga. . .” based on Fig. 10. From this design, we cansee that if two sets of inferred interests (from questions orusers) are similar to each other, they will have similar (ques-tion or interest) IDs.

The users periodically update metadata containing theirquestions and answering statistics to the registration server.Based on these metadata, the rules and axioms can be addedand updated to reflect the most current behaviors of theusers in the system. These updated rules and axioms arethen remotely configured into the mobile phones of theusers. We will use other techniques [34] to improve the scal-ability of the rule-based inference for a large-scale knowl-edge base in our future work.

3.4 Similarity Value Calculation

After users’ social information and questions are trans-formed into numerical strings, the similarity between a userand a question can be calculated based on two parts: interestsimilarity between the user and question, and answer qual-ity between the question sender and receiver.

3.4.1 Interest Similarity Calculation

To evaluate the interest similarity of a question of user i(qi) and a user j, we use a method proposed in [35]. Weuse IDqi and IDj to denote the interest strings of questionqi and user j, respectively. We use nðqi;jÞ to denote thenumber of interests owned by IDqi but not by IDj; uselðqi;jÞ to denote the number of categories of interest

Fig. 9. Lattice in an inference engine.Fig. 10. An example of interest arrays.

SHEN ET AL.: SOS: A DISTRIBUTED MOBILE Q&A SYSTEM BASED ON SOCIAL NETWORKS 1071

Page 7: ME/Mtech projects

elements owned both by IDqi and IDj, and mðqi;jÞ thenumber of categories of interest elements owned by IDj

but not by IDqi . Then the interest similarity of question qiand user j is defined as

Sðqi;jÞ ¼lðqi;jÞ þ 1

2

1

lðqi;jÞ þ nðqi;jÞ þ 2þ 1

lðqi;jÞ þmðqi;jÞ þ 2

� �:

(1)

The value of Sðqi;jÞ ranges in the classical spectrum ½0; 1�,and it represents the level of likelihood that two stringsunder comparison are actually similar. If two strings havecomplete overlapping ðn ¼ m ¼ 0Þ, Sðqi;jÞ approaches 1 asthe number of common features grows. The underlyingidea of Equation (1) is that two strings with longer completeoverlapping should have higher similarity than the twostrings with less complete overlapping. In the case of nooverlapping ðl ¼ 0Þ, the function approaches to 0 as long asthe number of non-shared entries grows. It indicates thattwo strings with a larger number of entries and share nocommon entries are more likely to have smaller similaritythan the two strings with a smaller number of entries andshare no common entries.

3.4.2 Answer Quality Calculation

Social closeness directly affects the willingness of people toanswer or forward questions. Several recent works havestudied how to effectively calculate the social closenessbetween two users [36], [37]. However, these social close-ness value calculation mechanisms are based on the wholesocial network topology, which are energy consuming. It iseven worse when the social network dynamically changes.Therefore, the topology based social closeness calculationmethods are not suitable for energy-stringent mobile devi-ces in SOS. To reduce the load on the mobile devices, eachuser in SOS locally manages its first hand information onthe answer quality of each of his/her friends. As the perfor-mance of the SOS largely depends on the activeness and theknowledge base of the users, user i considers the number ofreceived answers from user j and their associated qualityratings when calculating the answer quality of user j. Wecall it as the feedback mechanism. Specifically, SOS initiallylets users indicate the answer quality value of a newlyadded friend. For each received answer, an asker can ratethe quality of the answer within rating scale R ¼ ½1; 5�. Theanswer quality value is periodically updated based on thenumber of answers received from friend j during eachperiod T and the associated quality rating (r 2 ½1; 5�). Forthe kth question sent from node i to node j, if node ireceives an answer from node j during T , xk ¼ 1; otherwise,xk ¼ 0. The parameter xk is used to represent the willing-ness of node j to answer questions from node i. Then, theanswer quality Qði;jÞ is calculate by:

Qði;jÞ ¼ a �Qði;jÞ þ ð1� aÞ �Xk

ðxk � rk=RÞðxk ¼ 0; 1Þ; (2)

where a 2 ½0; 1� is a damping factor, rk is node i’s quality rat-ing for the kth answer received from node j. A larger Qði;jÞimplies that user j is willing and able to provide high-qualityanswers to user i.

Considering the high dynamism of the social networks,in which the willingness of users to answer questions andthe quality of answers from a user to another user maychange over time, we add a damping factor a into theanswer quality calculation. In this way, the answer qualitybetween two nodes will be sensitive to parameters r and x.That is, a node can quickly notice the active answerers whowere inactive before and identify them as potentialanswerer to its questions.

3.4.3 Best Answerer Metric Calculation

Based on Sections 3.4.1 and 3.4.2, for its generated orreceived question qi that it cannot answer, node i calculatesthe best answerer metric of each of its friends. That is,

BAðqi;jÞ ¼ bSðqi;jÞ þ ð1� bÞQði;jÞ; (3)

where b 2 ½0; 1Þ is a parameter used to adjust the weight ofthe interest similarity and answer quality. Node i thenselects the top K friends that have the highest BAðqi;jÞ val-ues and forwards the question to them. We confine thequestion forwarding TTL to three since the social trustbetween two nodes decrease exponentially with distance.This relationship has been confirmed by other studies [38],[39]. Binzel and Fehr [38] discovered that a reduction insocial distance between two persons significantly increasesthe trust between them. Swamynathan et al. [39] foundthat people normally conduct e-commerce business withpeople within 2-3 hops in their social network.

Algorithm 1 shows the pseudocode of the process for thebest answerer metric calculation and best answerer selectionconducted by node i. If node i does not receive answers forits created question during the time corresponding to TTL, itresorts to the centralized server for the answers, whereall users conduct Q&A activities as in online Q&A sites.Lines 4-6 are used to periodically update answer quality ofeach of its friends. Lines 8-13 calculate each friend’s bestanswerer metric and generate a list including all metric

1072 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 25, NO. 4, APRIL 2014

Page 8: ME/Mtech projects

values. Lines 14-17 identify the top-K friends with the high-est best answerer metric values and send the question tothem. Answer quality Qði;jÞ is pre-processed, and only inter-est similarity Sðqi;jÞ needs to be calculated at run time. TheSðqi;jÞ calculation has a time complexity of OðjF ijÞ. As thenumber of keywords in a question is generally very small,the calculation of Sðqi;jÞ should take a short time and costs lit-tle computation resources of the mobile devices. This top-Kfriend selection algorithm has a time complexity ofOðjF ijÞ.

4 PERFORMANCE EVALUATION

We evaluated the SOS system using our crawled questionsfrom Yahoo!Answers. Since Yahoo!Answers does not haveuser profile information, we crawled 1;000 users from Face-book to form a social network. We used one user as a seedand used breadth-first search to crawl their personal profileinformation. We ignored users that did not fill out their pro-files. The crawling stopped when 1,000 users were crawled.The users are highly clustered due to the high clusteringfeature of the social networks.

Users’ profiles contain their current locations, educationbackgrounds, hobbies and interests, such as books, movies,music and television programs. This information wasparsed and conversed to FOL and finally encoded as stringsusing the method introduced previously. In the experiment,we focused on evaluating the questions related to movies,because most of the Facebook users filled out a largeamount of information in the movie section in their profiles.

Since the Facebook data set is separated from theYahoo!Answers data set, we cannot directly tell which per-sons can answer which questions. Therefore, to make theexperiment operable, we focused on the questions in theMovie categories in Yahoo!Answers, where we manuallyselected 100 questions with keywords (e.g., movie type,directors) that can be mapped to a Facebook user’s profileinterests. For each of the 100 questions, we randomlyassigned the answers of the question to the Facebook userswhose profile interests (i.e., action movie, directors) matchmost to the question’s keywords; one answer to one Face-book user. We use the answer quality indicated in theYahoo!Answer as answer ratings in our experiment. Thebest answers have rating 5, the spam answers have rating 1and all other answers have rating 3. The 100 questions werereplicated for 10 times and randomly assigned to the Face-book users to ask. We set b to 0:5 to balance the impact ofinterest similarity and answer quality and set a to 0.3 bydefault. In order to build the correlations between actors,movie companies, directors, we imported movie data inInternet Movie Data Base (IMDB) into the fuzzy database.Thus, when a question is issued for a certain actor A, theinference engine can find the movies and directors that arerelated to actor A. Then, the users who are fond of thesemovies or the directors are considered as the potential ques-tions answerers. As a result, SOS with an inference enginecan find more matching answerers than SOS without aninference engine.

By default, the number of friends K that a user selects forforwarding the question was set to five. The number ofhops in a social network (social hops in short) that a ques-tion is forwarded was set to three. We use RLA to denote

the number of ReLevant Answers to a question existing inthe system (i.e., the number of answers to a question inYahoo!Answers), and RTA to represent the number ofReTrieved Answers in the SOS system for a question. In theexperiments, we focus on the following metrics:

� Precision rate: a measure of exactness of the returnedresults: ðjRLA

TRTAjÞ=jRTAj.

� Recall rate: a measure of completeness of the returnedresults: jRLA

TRTAj=jRLAj.

� Overhead: the number of messages transmitted in thesystem during the entire simulation process.

� Delay: the time duration between a query is issuedand the first answer is received.

In our evaluation, the users returned by the differentapproaches are judged correct if the user has the correctanswer to the question as in the Q&A data set from theYahoo Q&A trace. The quality of the answers is also shownin the trace data. We use SOS-w/-infer to denote SOS withthe FOL inference engine, and use SOS-w/o-infer to denoteSOS without the FOL inference engine. Unless otherwisespecified, we do not have feedback mechanism in SOS andthe answer quality is always the initial value 1.

We compare the routing performance of SOS with theFlooding and Random walk routing methods. We useFlooding to represent the broadcasting-based social-basedQ&A systems [17], [18], [19], in which a node forwards aquestion to all of its friends in the social network. In Ran-dom walk, a node forwards a question to K randomlyselected friends. In both methods, a node stops forwardinga question when it returns an answer. Random walk canmimic the Q&A behavior pattern of a user in the onlineQ&A sites, in which the question is randomly visited by dif-ferent users until an answer is received. In this section, weassume that all the nodes in the system are willing toanswering the questions if they are able to.

Fig. 11 shows the comparison results of the precisionrates of the four systems. We varied the number of selectedfriends K for each node from five to 20 with five increase ineach step. We see that SOS has the highest query precisionrate because SOS can accurately identify the potentialanswerers based on their social information and relation-ship to the question and the asker. SOS-w/-infer outper-forms SOS-w/o-infer as SOS-w/-infer has much moreparsed information. In Flooding, a node forwards a ques-tion to all of its friends, so its precision rate is relative low.In Random walk, the queries are randomly sent to users, soit generates low precision rate. It is interesting to find thatthe precision rate decreases in SOS-w/o-infer and SOS-w/-infer as K increases while it remains constant in Flooding

Fig. 11. Query precision rate versusK:

SHEN ET AL.: SOS: A DISTRIBUTED MOBILE Q&A SYSTEM BASED ON SOCIAL NETWORKS 1073

Page 9: ME/Mtech projects

and Random walk. This is because some questions havevery few users who are able to answer them. If such a ques-tion is forwarded to more neighbors, more users that cannotanswer the question will receive the questions. As Kincreases, Random walk performs similar to Flooding asmore users that are unable to answer a question receivethe question.

Fig. 12 shows the comparison results of the recall ratesof the four systems. Since Flooding forwards a questionto all the neighbors of a node in the social network, it pro-duces the highest recall rate. As K increases, the recallrate of SOS and Random walk increases. This is becauseas more users receive questions, more relevant users canreceive questions. Although SOS can find relevantanswerers with high precision, it may not be able to iden-tify all of them, thus generating lower recall than Flood-ing. Again, SOS-w/-infer outperforms SOS-w/o-inferslightly because of its higher interest derivation ability.As a user randomly approaches others for answers inRandom walk, its recall rate is lower than SOS.

To evaluate the scalability performance, Fig. 13 showsoverhead measured by the number of queries with differ-ent network sizes. For the network with size N , weselected the first N crawled users in the data crawling toensure that the evaluated social network of these N usersmaintains the same topology as in Facebook. Flooding for-wards a question to all the neighbors of an asker in thesocial network, leading to a high overhead, although it canimprove its recall rate significantly. Random walk haslower overhead than Flooding, but it still has much higheroverhead than SOS as it needs to keep forwarding to ran-domly chosen nodes until finding an answerer. With highprecision, SOS only needs to forward questions to only afew users, thus producing much lower overhead thanFlooding and Random walk. We also see that as the net-work size increases, the overhead in Flooding and Randomwalk increases, but the overhead in SOS remains constant.In SOS, most potential answerers can be accurately located,

so a larger network size does not increase the number ofquestion forwarding hops.

Fig. 14 shows the comparison results of average questionresponse delay versus network size of the four systems. Weuse the medium response delay of 12 min in Twitter asshown in Fig. 3 as the response delay in each hop in thesocial network. Since Flooding forwards a questions to allthe neighbors in the social network, any neighbor with theknowledge of the question can answer the question immedi-ately. Therefore, it generates the lowest response delay. AsSOS-w/o-infer and SOS-w/-infer can accurately identifythe potential answerers, their average reply delay is compa-rable to Flooding. Since SOS-w/-infer generates much moreinferred information than SOS-w/o-infer for potentialanswerer selection, SOS-w/-infer produces much lessresponse delay than SOS-w/o-infer. In Random walk, ran-domly selected friends have low probability to answer thequestion, thus a node needs to search more nodes to reachan answerer, leading to relatively high delay. We also seethat as the network size increases, the delay of Randomwalk increases slightly and the delay of SOS and Floodingremains constant due to the same reasons as in Fig. 13.Though Flooding provides constant response delay, itcomes at a high cost of flooding overhead, while SOS gener-ates considerably lower overhead.

We then test the impact of question forwarding distanceon the Q&A performance. Fig. 15 shows the recall rates ofSOS-w/o-infer and SOS-w/-infer versus the different num-ber of social hops that a question is forwarded. We see thatthe recall rate of both systems increases as the number ofsocial hops increases. This is because the increased socialhops cause more users to receive a question, leading tomore received relevant answers. Fig. 16 shows the compari-son results of precision rates of the systems versus the dif-ferent number of social hops that a question is forwarded.We see that 80 percent of the answers can be retrieved fromthe direct friends. This is because in a social network,friends with a close social relationship are closely clustered

Fig. 12. Query recall rate versusK:

Fig. 13. Communication overhead.

Fig. 14. Communication delay.

Fig. 15. Query recall with the number of social hops.

1074 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 25, NO. 4, APRIL 2014

Page 10: ME/Mtech projects

and are likely to have common interests. SOS-w/-infer hasboth a higher recall and precision rate than SOS-w/o-inferbecause SOS-w/-infer can identify the potential answererswith higher accuracy with its inferred information fromusers and questions.

5 CONCLUSION

In this paper, we present the design and implementation ofa distributed Social-based mObile Q&A System (SOS). SOSis novel in that it achieves lightweight distributed answerersearch, while still enables a node to accurately identify itsfriends that can answer a question. SOS uses the FOL repre-sentation and inference engine to derive the interests ofquestions, and interests of users based on user social infor-mation. A node considers both its friend’s parsed interestsand answer quality in determining the friend’s similarityvalue, which measures both the capability and willingnessof the friend to answer/forward a question. Compared tothe centralized social network based Q&A systems that suf-fer from traffic congestions and high server bandwidth cost,SOS is a fully distributed system in which each node makeslocal decision on question forwarding. Compared to broad-casting, SOS generates much less overhead with its limitedquestion forwarding hops. Since each user belongs to sev-eral social clusters, by locally selecting most potentialanswerers, the question is very likely to be forwarded toanswerers that can provide answers. The low computationcost makes the system suitable for low-end mobile devices.We conducted extensive trace-driven simulations andimplemented the system on iPod Touch/iPhone mobiledevices. The results show that SOS can accurately identifyanswerers that are able to answer questions. Also, SOSearns high user satisfaction ratings on answering both fac-tual and non-factual questions. In the future, we will studythe combination of SOS and cloud-based Q&A system. Wewill also release the application in the App Store and studythe Q&A behaviors of users in a larger-scale social network.

ACKNOWLEDGMENTS

This research was supported in part by US NSF GrantsCNS-1254006, CNS-1249603, CNS-1049947, CNS-1156875,CNS-0917056 and CNS-1057530, CNS-1025652, CNS-0938189, CSR-2008826, CSR-2008827, Microsoft ResearchFaculty Fellowship 8300751, and US Department of Ener-gy’s Oak Ridge National Laboratory including the ExtremeScale Systems Center located at ORNL and DoD4000111689. An early version of this work was presented inthe Proc. of ICDCS’12 [29].

REFERENCES

[1] Google, http://www.google.com, 2013.[2] Bing, http://www.bing.com, 2013.[3] B.M. Evans and E.H. Chi, “An Elaborated Model of Social

Search,” J. Information Processing and Management, vol. 46,pp. 656-678, 2009.

[4] L. Terveen, W. Hill, B. Amento, D. McDonald, and J. Creter,“PHOAKS: A System for Sharing Recommendations,” Comm.ACM, vol. 40, pp. 59-62, 1997.

[5] M.S. Ackerman, “Augmenting Organizational Memory: A FieldStudy of Answer Garden,” ACM Trans. Information Systems,vol. 16, pp. 203-224, 1998.

[6] L.G. Terveen, P.G. Selfridge, and M.D. Long, “Living DesignMemory: Framework, Implementation, Lessons Learned,”Human-Computer Interaction, vol. 10, pp. 1-37, 1995.

[7] E. Amitay, D. Carmel, N. Har’El, S. Ofek Koifman, A. Soffer,S. Yogev, and N. Golbandi, “Social Search and DiscoveryUsing a Unified Approach,” Proc. 20th ACM Conf. Hypertextand Hypermedia (HT ’09), 2009.

[8] D. Carmel, N. Zwerdling, I. Guy, S. Ofek-Koifman, N. Har’el, I.Ronen, E. Uziel, S. Yogev, and S. Chernov, “Personalized SocialSearch Based on the User’s Social Network,” Proc. 18th ACM Conf.Information and Knowledge Management (CIKM ’09), 2009.

[9] S. Kolay and A. Dasdan, “The Value of Socially Tagged URLs for aSearch Engine,” Proc. 18th Int’l Conf. World Wide Web (WWW ’09),2009.

[10] S. Bao, G. Xue, X. Wu, Y. Yu, B. Fei, and Z. Su, “Optimizing WebSearch Using Social Annotations,” Proc. 16th Int’l Conf. World WideWeb (WWW ’07), 2007.

[11] H.H. Chen, L. Gou, X. Zhang, and C.L. Giles, “CollabSeer: ASearch Engine for Collaboration Discovery,” Proc. 11th Ann. Int’lACM/IEEE Joint Conf. Digital Libraries (JCDL ’11), 2011.

[12] C.Y. Lin, N. Cao, S.X. Liu, S. Papadimitriou, J. Sun, and X. Yan,“Smallblue: Social Network Analysis for Expertise Search andCollective Intelligence,” Proc. IEEE 25th Int’l Conf. In Data Engi-neering (ICDE ’09), 2009.

[13] H. Kautz, B. Selman, and M. Shah, “Referral Web: CombiningSocial Networks and Collaborative Filtering,” Comm. ACM,vol. 40, pp. 63-65, 1997.

[14] D.W. McDonald and M.S. Ackerman, “Expertise Recommender:A Flexible Recommendation System and Architecture,” Proc.ACM Conf. Computer Supported Cooperative Work (CSCW ’00), 2000.

[15] Yahoo answer, http://answers.yahoo.com, 2013.[16] Ask.com., http://www.ask.com, 2013.[17] F. Harper, D. Raban, S. Rafaeli, and J. Konstan, “Predictors of

Answer Quality in Online Q&A Sites,” Proc. SIGCHI Conf. HumanFactors in Computing Systems (SIGCHI ’08), 2008.

[18] M.R. Morris, J. Teevan, and K. Panovich, “What Do People AskTheir Social Networks, and Why?: A Survey Study of Status Mes-sage Q&A Behavior,” Proc. SIGCHI Conf. Human Factors in Com-puting Systems (CHI ’10), 2010.

[19] J. Teevan, M.R. Morris, and K. Panovich, “Factors AffectingResponse Quantity, Quality, and Speed for Questions Asked viaSocial Network Status Messages,” Proc. ICWSM, 2011.

[20] R.W. White, M. Richardson, and Y. Liu, “Effects of CommunitySize and Contact Rate in Synchronous Social Q&A,” Proc. SIGCHIConf. Human Factors in Computing Systems, 2011.

[21] M. Richardson and R.W. White, “Supporting Synchronous Socialq&a Throughout the Question Lifecycle,” Proc. 20th Int’l Conf.World Wide Web (WWW ’11), 2011.

[22] D. Horowitz and S.D. Kamvar, “The Anatomy of a Large-ScaleSocial Search Engine,” Proc. 19th Int’l Conf. World Wide Web(WWW ’10), 2010.

[23] Twitter, http://twitter.com/, 2013.[24] Mobile internet stats roundup, http://econsultancy.com/us/

blog, 2013.[25] Facebook may be growing too fast, http://techcrunch.com/, 2013.[26] A. Mtibaa, M. May, C. Diot, and M. Ammar, “Peoplerank: Social

Opportunistic Forwarding,” Proc. IEEE INFOCOM, 2010.[27] J. Raacke and J. Bonds-Raacke, “MySpace and Facebook: Apply-

ing the Uses and Gratifications Theory to Exploring Friend-Net-working Sites,” CyberPsychology & Behavior, vol. 11, p. 169, 2008.

[28] M.R. Morris, J. Teevan, and K. Panovich, “A Comparison of Infor-mation Seeking Using Search Engines and Social Networks,” Proc.Fourth Int’l AAAI Conf. Weblogs and Social Media (ICSWM ’10),2010.

Fig. 16. Query precision with the number of social hops.

SHEN ET AL.: SOS: A DISTRIBUTED MOBILE Q&A SYSTEM BASED ON SOCIAL NETWORKS 1075

Page 11: ME/Mtech projects

[29] Z. Li, H. Shen, G. Liu, and J. Li, “SOS: A Distributed Context-Aware Question Answering System Based on SocialNetworks,” Proc. IEEE 32nd Int’l Conf. Distributed ComputingSystems (ICDCS ’12), 2012.

[30] Readwriteweb, http://twitter.com/#/RWW, 2013.[31] R.M. Smullyan, First-Order Logic. Dover Publications, 1995.[32] K. Toutanova and C.D. Manning, “Enriching the Knowledge

Sources Used in a Maximum Entropy Part-of-Speech Tagger,”Proc. Joint SIGDAT Conf. Empirical Methods in Natural LanguageProcessing and Very Large Corpora (SIGDAT ’00), 2000.

[33] C. Manning and H. Schuetze, Foundations of Statistical NaturalLanguage Processing. The MIT Press, June 1999.

[34] N. Lao, T. Mitchell, and W.W. Cohen, “Random Walk Inferenceand Learning in a Large Scale Knowledge Base,” Proc. Conf. Empir-ical Methods in Natural Language Processing (EMNLP ’11), 2011.

[35] M. Kirsten and S. Wrobel, “Extending K-Means Clustering toFirst-Order Representations,” Proc. Inductive Logic Programming10th Int’l Conf. (ICILP ’00), 2000.

[36] Z. Li, H. Shen, and K. Sapra, “Leveraging Social Networks toCombat Collusion in Reputation Systems for Peer-to-PeerNetworks,” Proc. IEEE Int’l Parallel and Distributed ProcessingSymp. (IPDPS ’11), 2011.

[37] Z. Li and H. Shen, “SOAP: A Social Network Aided Personalizedand Effective Spam Filter to Clean Your E-mail box,” Proc. IEEEINFOCOM, 2011.

[38] C. Binzel and D. Fehr, “How Social Distance Affects Trust andCooperation: Experimental Evidence in a Slum,” Proc. ERF 16thAnnual Conference: Shocks, Vulnerability and Therapy, 2009.

[39] E-commerce, http://en.wikipedia.org/wiki/E-commerce, 2013.[40] S. Chakrabarti, “Dynamic Personalized Pagerank in Entity-

Relation Graphs,” Proc. 16th Int’l Conf. World Wide Web(WWW ’07), 2007.

[41] M.F. Bulut, Y.S. Yilmaz, and M. Demirbas, “CrowdsourcingLocation-Based Queries,” Proc. IEEE Int’l Conf. Pervasive Com-puting and Comm. Workshops (PERCOMW ’11), 2011.

[42] B.M. Evans, S. Kairam, and P. Pirolli, “Exploring the CognitiveConsequences of Social Search,” Proc. CHI ’09 Extended Abstractson Human Factors in Computing Systems (CHI EA ’09), 2009.

[43] B. Hecht, J. Teevan, M.R. Morris, and D. Liebling, “Search Bud-dies: Bringing Search Engines into the Conversation,” Proc.ICWSM, 2012.

[44] C. Lampe, J. Vitak, R. Gray, and N.B. Ellison, “Perceptions of Face-book’s Value as an Information Source,” Proc. SIGCHI Conf.Human Factors in Computing Systems (CHI ’12), 2012.

[45] O. Escalante, T. P�erez, J. Solano, and I. Stojmenovic, “Sparse Struc-tures for Searching and Broadcasting Algorithms over InternetGraphs and Peer-to-Peer Computing Systems,” Peer-to-Peer Net-working and Applications, 2011.

[46] S. Ioannidis and P. Marbach, “On the Design of Hybrid Peer-to-Peer Systems,” Proc. ACM SIGMETRICS Int’l Conf. Measurementand Modeling of Computer Systems (SIGMETRICS ’08), 2008.

[47] R. Ahmed and R. Boutaba, “Plexus: A Scalable Peer-to-PeerProtocol Enabling Efficient Subset Search,” IEEE/ACM Trans.Networking (TON), vol. 17, no. 1, pp. 130-143, Feb. 2009.

[48] M. Yang and Y. Yang, “An Efficient Hybrid Peer-to-Peer Systemfor Distributed Data Sharing,” IEEE Trans. Computers, vol. 59,no. 9, pp. 1158-1171, Sept. 2010.

[49] H.-W. Ferng and I. Christanto, “A Globally Overlaid Hierarchicalp2p-Sip Architecture with Route Optimization,” IEEE Trans. Paral-lel and Distributed Systems, vol. 22, no. 11, pp. 1826-1833, Nov. 2011.

[50] H. Shen and K. Hwang, “Locality-Preserving Clustering and Dis-covery of Resources in Wide-Area Distributed ComputationalGrids,” IEEE Trans. Computers, vol. 61, no. 4, pp. 458-473, Apr.2012.

[51] H. Shen and C. Xu, “Hash-Based Proximity Clustering for Effi-cient Load Balancing in Heterogeneous DHT Networks,” J. Paral-lel and Distributed Computing, vol. 68, no. 5, pp. 686-702, May 2008.

[52] H. Shen and C. Xu, “Locality-Aware and Churn-Resilient LoadBalancing Algorithms in Structured Peer-to-Peer Networks,”IEEE Trans. Parallel and Distributed Systems, vol. 18, no. 6, pp. 849-862, June 2007.

[53] H. Chandler, H. Shen, L. Zhao, J. Stokes, and J. Li, “Toward P2P-Based Multimedia Sharing in User Generated Contents,” IEEETrans. Parallel and Distributed Systems, vol. 23, no. 5, pp. 966-975,May 2012.

[54] A. Iamnitchi, M. Ripeanu, E. Santos-Neto, and I. Foster, “TheSmall World of File Sharing,” IEEE Trans. Parallel and DistributedSystems, vol. 22, no. 7, pp. 1120-1134, July 2011.

[55] K.C.J. Lin, C.P. Wang, C.F. Chou, and L. Golubchik, “SocioNet: ASocial-Based Multimedia Access System for Unstructured P2PNetworks,” IEEE Trans. Parallel and Distributed Systems, vol. 21,no. 7, pp. 1027-1041, July 2010.

[56] X. Cheng and J. Liu, “NetTube: Exploring Social Networks forPeer-to-Peer Short Video Sharing,” Proc. IEEE INFOCOM, 2009.

[57] H. Shen and C.-Z. Xu, “Leveraging a Compound Graph BasedDHT for Multi-Attribute Range Queries with PerformanceAnalysis,” IEEE Trans. Computers, vol. 61, no. 4, pp. 433-447,Apr. 2012.

[58] F. Lehrieder, S. Oechsner, T. Hossfeld, Z. Despotovic, W. Kellerer,and M. Michel, “Can P2P-Users Benefit from Locality-Awareness?” Proc. IEEE 10th Int’l Conf. Peer-to-Peer Computing,2010.

[59] F. Picconi and L. Massoulie, “ISP Friend or Foe? Making P2P LiveStreaming ISP-Aware,” Proc. 29th IEEE Int’l Conf. Distributed Com-puting Systems (ICDCS ’09), 2009.

[60] D. Ciullo, M.A. Garica, A. Horvath, E. Leonardi, M. Mellia, D.Rossi, M. Telek, and P. Veglia, “Network Awareness of P2P LiveStreaming Applications,” Proc. IEEE Int’l Symp. Parallel and Dis-tributed Processing (IPDPS ’09), 2009.

[61] D.R. Choffnes and F.E. Bustamante, “Taming the Torrent: A Prac-tical Approach to Reducing Cross-ISP Traffic In P2P Systems,”Proc. ACM SIGCOMM, 2008.

[62] S. Seetharaman and M.H. Ammar, “Managing Inter-Domain Traf-fic In the Presence of Bittorrent File-Sharing,” Proc. ACM SIGMET-RICS Int’l Conf. Measurement and Modeling of Computer Systems,2008.

[63] H. Zhang, Z. Shao, M. Chen, and K. Ramchandran, “OptimalNeighbor Selection in BitTorrent-like Peer-to-Peer Networks,”Proc. Joint Int’l Conf. Measurement and Modeling of Computer Systems(SIGMETRICS ’11), 2011.

[64] S. Genaud and C. Rattanapoka, “Large-Scale Experiment Of Co-Allocation Strategies For Peer-To-Peer Supercomputing in P2P-MPI,” Proc. IEEE Int’l Symp. Parallel and Distributed Processing(IPDPS ’08), 2008.

Haiying Shen received the BS degree in com-puter science and engineering from Tongji Uni-versity, China in 2000, and the MS and PhDdegrees in computer engineering from WayneState University in 2004 and 2006, respectively.She is currently an assistant professor in theDepartment of Electrical and Computer Engi-neering at Clemson University. Her researchinterests include distributed computer systemsand computer networks, with an emphasis onP2P and content delivery networks, mobile com-

puting, wireless sensor networks, and cloud computing. She is a Micro-soft faculty fellow of 2010 and a senior member of the IEEE and ACM.

Ze Li received the BS degree in electronics andinformation engineering from Huazhong Univer-sity of Science and Technology, China in 2007,and the PhD degree in computer engineeringfrom Clemson University. His research interestsinclude distributed networks, with an emphasison peer-to-peer and content delivery networks.He is a data scientist in the MicroStrategy Incor-poration. He is a student member of the IEEE.

1076 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 25, NO. 4, APRIL 2014

Page 12: ME/Mtech projects

Guoxin Liu received the BS degree fromBeiHang University, in 2006, and the MS degreefrom the Institute of Software, Chinese Academyof Sciences, in 2009. He is working toward thePhD degree in the Department of Electrical andComputer Engineering of Clemson University.His research interests include distributed net-works, with an emphasis on Peer-to-Peer, datacenter and online social networks. He is a studentmember of the IEEE.

Jin Li received the PhD (with honor) degree fromTsinghua University in 1994. After brief stints atUSC and Sharp Labs, he joined MicrosoftResearch in 1999, first as one of the foundingmembers of Microsoft Research Asia, and thenmoved to Microsoft Research (Redmond, WA) in2001. From 2000, he has served as an affiliatedprofessor in Tsinghua University. He managesthe Compression, Communication and Storagegroup. Blending theory and system, he excels atinterdisciplinary research, and is dedicated to

advance communication and information theory and apply it to practicalsystem building. He is a research manager and principal researcher atMicrosoft Research, Redmond, Washington. His invention has beenintegrated into many Microsoft products. Recently, he and his groupmembers have made key contributions to Microsoft product line (e.g.,RemoteFX for WAN in Windows 8, Primary Data Deduplication inWindows Server 2012, and Local Reconstruction Coding in WindowsAzure Storage), that leads to commercial impact in the order of hundredsof millions of dollars. He was the recipient of Young Investigator Awardfrom Visual Communication and Image Processing’98 (VCIP) in 1998,ICME 2009 Best Paper Award, and USENIX ATC 2012 Best PaperAward. He is/was the associate editor/guest editor of the IEEE Transac-tion on Multimedia, Journal of Selected Area of Communication, Journalof Visual Communication and Image Representation, P2P networkingand applications, and Journal of Communications. He was the generalchair of PV2009, the lead program chair of ICME 2011, and the technicalprogram chair of CCNC 2013. He is a fellow of the IEEE.

" For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.

SHEN ET AL.: SOS: A DISTRIBUTED MOBILE Q&A SYSTEM BASED ON SOCIAL NETWORKS 1077