[ieee 2009 isecs international colloquium on computing, communication, control, and management...

4
2009 ISECS International Colloquium on Computing, Communication, Control, and Management 978-1-4244-4246-1/09/$25.00 ©2009 IEEE CCCM 2009 Understanding and Searching the Online Video in China Dan Guo, Changjia Chen, Yishuai Chen School of Electrical and Information Engineering, Beijing Jiaotong University, Beijing, China, 100044 [email protected] , [email protected], [email protected], Abstract—Nowadays, as the fast expansion of broadband internet access, users tend to simply view online video in web browsers, instead of download to local machine before watching. We investigate the necessity of introducing a general video search engine, especially for China internet users, and designed our indexing strategies and algorithms. We present our investigation result and our design, along with the influence of our search engine by studying user behavior of search and clicks. Keywords- search engine; internet video; user behavior; long tail I. INTRODUCTION As the penetration of broadband internet access [1], the online internet video is prevailing. Baidu [2] is the dominant search engine in China with 70% of market share in China [3]. In 2007, we were aware of the rapid emerging of online video contents. Based upon Baidu’s webpage database, we did an extensive analysis of the scale and distribution of online video in China, recognized the necessity of introducing a video search service for users to fit their needs. Therefore, we developed and deployed our 1st generation of video search service [4]. This service keeps evolving as the increasing of online internet video in Chinese websites. Today, after two years, this service has became one of the largest video search services in China internet market, and 10-20% traffic of the most popular online video websites in China is generated from our video search service [5]. In this paper, we present our measurement of online video of China in 2007 and 2009, followed by our design of indexing strategies and indexing algorithms. We also report our measurement results of user’s search and click behavior. In detail, our contributions include: 1. We present the statistics, categorization, website distribution, and the increasing rate of online internet video in China. 2. We present our indexing strategies and indexing algorithm, which considers miscellaneous video-related text information. We prove its correctness by measurement. 3. We present our measurement and analysis of user’s search and click behavior. The remainder of the paper is organized as follows. We begin with related work in Section II. In Section III, wepresent our analysis of online internet video in 2007, and our indexing strategy. In Section IV, we describe our indexing algorithms and its proof by user behavior analyze. In Section V, we present our observation of Chinese online internet video in 2009. II. RELATED WORK As early as 1997, [6] proposed a crawler and categorization engine to help exploring image and video content in Internet. The proposed crawler checks the extension name of the crawled URL to find the video, and then categorize the video by analyzing the video content and the related text information of the video. The video they analyzed is the traditional video, meaning it is referred by a URL, and the user treat them as common internet documents which need to be downloaded before viewing. To our knowledge, existed research about the online video focuses on the study of single video website. [7] analyzes the distribution and evolution procedure of popularity of video in Youtube [8], and discuss the performance of cache algorithms and P2P methods. [9] measures the architecture of content distribution network of 3 online video website and compares their performance. To detect duplication of a video content, [10] suggests a content-based method and proves its efficiency in practice by experiments. III. STATISTICS AND INDEXING STRATEGY As early as 2005, Youtube-like video sharing websites emerged in China and grew quickly, e.g. Tudou [11]. These websites provided users with both functionalities of video submission and online video viewing, as well as brief description and categorization of the video are allowed. The enormous users of internet in China, especially those with broadband network access, contributed their enthusiasm of video sharing and viewing over web because the broadband network has made all these much more feasible and simple than before. Enlightened by the success of these online video sharing websites, the traditional Yahoo-like portal websites also start including more and more online video in their web pages, such as the embedded video contents in news reports. Even the dedicated channels for such video contents are created, e.g. Sina.com , one of the largest portal site in China, has its own video channel [12]. Since then, the online video walked into the mainstream. Considering the number and quality of online video contents and users interest, we did an extensive statistics of the status of online Internet video in China, to evaluate the necessity to introduce a video search service, and carefully designed its

Upload: yishuai

Post on 14-Mar-2017

216 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: [IEEE 2009 ISECS International Colloquium on Computing, Communication, Control, and Management (CCCM) - Sanya, China (2009.08.8-2009.08.9)] 2009 ISECS International Colloquium on Computing,

2009 ISECS International Colloquium on Computing, Communication, Control, and Management

978-1-4244-4246-1/09/$25.00 ©2009 IEEE CCCM 2009

Understanding and Searching the Online Video in China

Dan Guo, Changjia Chen, Yishuai Chen School of Electrical and Information Engineering,

Beijing Jiaotong University, Beijing, China, 100044

[email protected], [email protected], [email protected],

Abstract—Nowadays, as the fast expansion of broadband internet access, users tend to simply view online video in web browsers, instead of download to local machine before watching. We investigate the necessity of introducing a general video search engine, especially for China internet users, and designed our indexing strategies and algorithms. We present our investigation result and our design, along with the influence of our search engine by studying user behavior of search and clicks.

Keywords- search engine; internet video; user behavior; long tail

I. INTRODUCTION As the penetration of broadband internet access [1], the

online internet video is prevailing. Baidu [2] is the dominant search engine in China with 70% of market share in China [3]. In 2007, we were aware of the rapid emerging of online video contents. Based upon Baidu’s webpage database, we did an extensive analysis of the scale and distribution of online video in China, recognized the necessity of introducing a video search service for users to fit their needs. Therefore, we developed and deployed our 1st generation of video search service [4]. This service keeps evolving as the increasing of online internet video in Chinese websites. Today, after two years, this service has became one of the largest video search services in China internet market, and 10-20% traffic of the most popular online video websites in China is generated from our video search service [5].

In this paper, we present our measurement of online video of China in 2007 and 2009, followed by our design of indexing strategies and indexing algorithms. We also report our measurement results of user’s search and click behavior.

In detail, our contributions include: 1. We present the statistics, categorization, website

distribution, and the increasing rate of online internet video in China.

2. We present our indexing strategies and indexing algorithm, which considers miscellaneous video-related text information. We prove its correctness by measurement.

3. We present our measurement and analysis of user’s search and click behavior.

The remainder of the paper is organized as follows. We begin with related work in Section II. In Section III, wepresent our analysis of online internet video in 2007, and our indexing strategy. In Section IV, we describe our indexing algorithms and its proof by user behavior analyze.

In Section V, we present our observation of Chinese online internet video in 2009.

II. RELATED WORK As early as 1997, [6] proposed a crawler and

categorization engine to help exploring image and video content in Internet. The proposed crawler checks the extension name of the crawled URL to find the video, and then categorize the video by analyzing the video content and the related text information of the video. The video they analyzed is the traditional video, meaning it is referred by a URL, and the user treat them as common internet documents which need to be downloaded before viewing.

To our knowledge, existed research about the online video focuses on the study of single video website. [7] analyzes the distribution and evolution procedure of popularity of video in Youtube [8], and discuss the performance of cache algorithms and P2P methods. [9] measures the architecture of content distribution network of 3 online video website and compares their performance.

To detect duplication of a video content, [10] suggests a content-based method and proves its efficiency in practice by experiments.

III. STATISTICS AND INDEXING STRATEGY As early as 2005, Youtube-like video sharing websites

emerged in China and grew quickly, e.g. Tudou [11]. These websites provided users with both functionalities of video submission and online video viewing, as well as brief description and categorization of the video are allowed. The enormous users of internet in China, especially those with broadband network access, contributed their enthusiasm of video sharing and viewing over web because the broadband network has made all these much more feasible and simple than before.

Enlightened by the success of these online video sharing websites, the traditional Yahoo-like portal websites also start including more and more online video in their web pages, such as the embedded video contents in news reports. Even the dedicated channels for such video contents are created, e.g. Sina.com , one of the largest portal site in China, has its own video channel [12].

Since then, the online video walked into the mainstream. Considering the number and quality of online video contents and users interest, we did an extensive statistics of the status of online Internet video in China, to evaluate the necessity to introduce a video search service, and carefully designed its

Page 2: [IEEE 2009 ISECS International Colloquium on Computing, Communication, Control, and Management (CCCM) - Sanya, China (2009.08.8-2009.08.9)] 2009 ISECS International Colloquium on Computing,

indexing strategy. In this section, we present our evaluation results and our indexing strategies for different websites.

A. The Status of Online Video in China in 2007 We had the largest database of Chinese web pages.

Naturally, we can evaluate the status of online video with this database to obtain a most comprehensive evaluation

We extract the online video links from the web page database, remove the duplicated links, and finally, by our estimation, there exist totally about 122 million online videos links and 1 million traditional video links. We categorize them into the following five types. Table I lists the numbers of videos they host and the corresponding percentage.

- Video sharing website: the website like Youtube, e.g. Tudou

- Portal website: the portal sites providing video contents, e.g. Sina, hupo.tv

- BBS/Blog website: the users post videos along with text information

- Aggregation website: It aggregates video from other websites and provide a unified interface to users.

- Traditional website: It provides downloading of traditional videos as defined in Section II.

It is necessary to address that the result of the estimation could be partially dependent to the techniques to mine the video contents over internet, and how the duplications be detected. However, above all, we can reach a conclusion that the number of video contents is big enough for a dedicated search service.

TABLE I. VIDEO HOSTED BY FOUR TYPES OF WEBSITE

No. of Video (m) Percentage (%) Video Sharing Website 65 52.8

Portal Website 18 14.6

BBS/Blog 18 14.6

Aggregation Website 21 17

Traditional Website 1 1

B. Indexing Strategy We determine our indexing strategy based upon how the

videos are distributed over internet. 1) Indexing strategy for video on video sharing websites:

Video sharing websites host the largest percentage of video contents over internet, which accounts for 52.8% of the total videos. On the other hand, they host the video locally. Therefore, we must index these videos with high quality.

2) Indexing strategy for video on portal websites: We treat it as the video of video sharing websites.

3) Indexing strategy for video on BBS/Blog websites: The percentage of video on BBS/Blog websites is considerable. However, most of these videos are the embedded form of video of other websites. It is because the BBS/Blog composers usually copy the link of video from other website and paste in the BBS/Blog. Therefore, we do not index the video. However, the surround text of the video provides useful information about the video. Therefore, we

can use these links to improve the indexing quality of the target video.

4) Indexing strategy for video on aggregation websites: The percentage of video on aggregation websites is also considerable. However, these videos are links of video of other websites, and usually the aggregation websites do not provide more information to describe the video. Therefore, we simply do not index these videos.

5) Indexing strategy for video on traditional websites: Although the number of this part of video is limited, its increasing rate is considerable, i.e., 11k/day. Moreover, the video file is directly accessible, which means we can access its content for further in-depth analysis. Therefore, we should index these videos. However, because the expected watching behavior of these video is different, we need to mark this difference in the result of searching.

In summary, we found it necessary to index the videos on video sharing websites, portal websites, and traditional websites, which amounts to s 65+18+1=84m videos. This conclusion approved the necessity of a dedicated video search service.

IV. INDEXING ALGORITHMS In this section, we introduce our indexing algorithm for

the video. Indexing algorithm determines what information is indexed by which the users can search for and find the desired videos.

Our indexing algorithm includes the following content, 1) Title of video / web page 2) Tag, category of video 3) Other surrounding text, e.g., description, users

comment, etc. The weight of them for indexing is: 1>2>3. The algorithm is simple and intuitive. However, is it

efficient and appropriate? We address this issue by studying user searching behavior in following two aspects.

A. Length of Users Search Queries We find the users usually type in a few Chinese words.

Fig. 1, the distribution of the number of character in users’ search query, shows that most of the search queries are composed of 2 to 11 Chinese characters. Because the Chinese word usually consists of 2 or more characters, it means the number of word in users search query is usually 1-5. The mean number of characters is 6.3337, meaning 2-3 words. Therefore, we conclude that users are accustomed to search by a few words.

The further observation of the search query suggests the following reason for the short length of the search query. The user’s searching behavior can be categorized into two types: 1) to purposely find a video; 2) random surfing. Correspondingly, their search could be: 1) the title of the video; 2) the interesting content, e.g., “Obama’, “Beauty’, etc. Neither of them needs long description. Therefore, the length of user’s search query is short.

The fact that the search query is usually short means it is succinct. Accordingly, the text information we include to index the video should be also succinct. Therefore, we include the following information of the video in the

Page 3: [IEEE 2009 ISECS International Colloquium on Computing, Communication, Control, and Management (CCCM) - Sanya, China (2009.08.8-2009.08.9)] 2009 ISECS International Colloquium on Computing,

Figure 1. Distribution of length of users query string

indexing and give them high weight. They are the title of video and web page, and the tag, categorization of the video. This information can be mined from the web pages, and they are highly relevant to the video, and most importantly, they are succinct.

The inclusion of the title of video in the index directly satisfies the requirement of the 1st type of user search behavior we described above. It is because the title of video is the most concise information to include a video. Therefore, we give it the highest weight. As a complement, we consider the title of the web page if it is necessary.

The inclusion of the tag and categorization of video in the index directly satisfies the requirement of the 2nd type of user search behavior we described above, because they are general.

Finally, as a complement, we include other surround text, e.g., description, user’s comment, etc., in the index.

B. Analyze of User Clicks By analyzing the user’s click behavior through our video

search service, we can evaluation how satisfied the users are served. We analyzed a segment of the user click log with 53.7m records, each record representing a click on the search result provided by our service. Fig. 2 plots the ranking-order distribution of number of click of video; Fig. 3 plots the ranking-order distribution of number of click of website.

Fig. 4 and Fig. 5 show the distributions are both near power-law. In other words, they all show characteristic of “Long Tail”. This observation is consistent with the assertion of Anderson that there exist huge opportunities in the unlimited number of non-popular items and websites. It shows the long tail effect is still valid in online access of online video. In particular, it is still valid when users access video through our video search service. The power-law curve of clicks means users can access both popular videos, such as hot TV shows and movies, and the rare videos such as educational programs and opera performances.

By studying the user behaviors of queries and clicks, we are assured that our index algorithm can serve the user well.

Figure 2. Distribution of click of video

Figure 3. Distribution of click of website

V. THE STATUS OF ONLINE VIDEO IN CHINA IN 2009 There are two years after we developed and deployed our

video search service in 2007. During this time, we continuously improve the coverage of our search engine, and also refine our indexing algorithms to match the users’ search requirement better. However, the basic indexing strategy and indexing algorithms keeps stable. In this section, we present the current status of online internet video in China, and compare it with that of two years ago.

A. Distribution of Video on Website We first present the distribution of number of owned

video of website. We get it from our video database. Fig. 4 plots the cumulative distribution of video of website. It shows a few websites host a lot of video. For example, the first 4, 8, and 10 websites account for 66.08%, 87.93%, and 93.18% of the total videos respectively. This observation proves the correctness of our strategy which places great emphasis on several key websites, meaning crawls, indexes, and validates the video on them more frequently.

On the other side, we find the residual around 10% videos are distributed widely. Their website distribution is near power-law. Fig. 5 plots the probability distribution of

Page 4: [IEEE 2009 ISECS International Colloquium on Computing, Communication, Control, and Management (CCCM) - Sanya, China (2009.08.8-2009.08.9)] 2009 ISECS International Colloquium on Computing,

video of website. It shows that, except the first 9 websites, the distribution of video of website on other websites is near power-law. The power-law distribution has recently been used to understand the so-called “the Long Tail” by Anderson [13]. Anderson asserted that there exist huge opportunities in the unlimited number of non-popular items. Obviously, we should include these videos in our search database too.

Above two findings prove the necessity of our video search service. Firstly, there is no dominant website in the area of online video. Therefore, the content is inevitably dispersed on multiple websites. As a result, our video search service can include all these dispersed video in one database and provide a unified interface for user to explore. Secondly, the videos of the websites on the long tail are more dispersed, and should not be ignored because of their considerable sum. Only a search engine can find these videos and provide the normal user with access.

B. The Growth of Online Video in China As described before, the sum of online video in China in

2007 is estimated to be 84m. With the newest database, we estimate the sum of online video in China in 2009 is around

Figure 4. Cumulative distribution of number of video of website

Figure 5. Probability distribution of number of video of website

180m, meaning the online video in China increased to 180/84=2.14 times in two years. This is a tremendous increase while compared with the increase of users of online video, which is just 25%, reported by [14]

We now evaluate the increasing of videos on video sharing websites and portal websites. In 2009, we have the percentages of video on video sharing websites and video on portal websites 72.49% and 27.51% respectively, meaning the numbers of these two kinds of video are 130.48m and 49.52m respectively. Therefore, they increase to 2 times and 2.75 times respectively from 2007 to 2009.

It is interesting to find the video on portal website increases more quickly than the video on video sharing website does. This observation shows the portal websites have catch the trend and is providing more and more online video on their website.

VI. CONCLUSION We present our decision and practice for our video search

engine in China. We report our measurement results of the status of online video in China and users’ click behavior, and prove the necessity of video search engine. The further work includes investigating users’ search behavior and how they are satisfied, and improving the indexing algorithms accordingly.

REFERENCES [1] T. Leighton, "Improving performance on the internet," October 2008

ACM QUEUE [2] http://www.baidu.com [3] "iResearch: Chinese Search Engine Request Report in Q1, 2009"

[Online document]: http://iresearchgroup.com.cn/html/Consulting/ search_engine/DetailNews_id_94902.html

[4] [Online document]: http://www.alexa.com/siteinfo/ku6.com# clickstream, 2009-5-17

[5] http://video.baidu.com [6] J.R. Smith, S.F. Chang, "An image and video search engine for the

world-wide web", Proc. SPIE Storage and Retrieval for Image and Video Databases V, 1997.

[7] M Cha, H Kwak, P Rodriguez, YY Ahn, S Moon, "I tube, you tube, everybody tubes: analyzing the world¡¯s largest user generated content video system", Proc. IMC'07, 2007

[8] http://www.youtube.com [9] M. Saxena, U. Sharan, S. Fahmy, "Analyzing video services in Web

2.0: a global perspective," Proc. the 18th International Workshop on Network and Operating Systems Support for Digital Audio and Video, 2008

[10] X Wu, AG Hauptmann, CW Ngo, "Practical elimination of near-duplicates from web video search,". Proc. the 15th international conference on Multimedia, 2007

[11] http://www.tudou.com [12] http://video.sina.com.cn/news [13] C. Anderson. The Long Tail: Why the Future of Business Is Selling

Less of More. Hyperion, 2006. [14] The 23rd Statistical Survey Report on the Internet Development in

China, http://www.cnnic.cn/uploadfiles/pdf/2009/3/23/153540.pdf, China Internet Network Information Center (CNNIC)