can skype be used beyond video calling?

7
Can Skype be used beyond video calling? Georgios Exarchakos Technische Universiteit Eindhoven The Netherlands +31402475653 [email protected] Vlado Menkovski Technische Universiteit Eindhoven The Netherlands +31402475653 [email protected] Antonio Liotta Technische Universiteit Eindhoven The Netherlands +31402473890 [email protected] ABSTRACT Skype nodes generate a substantial part of real-time bi-directional video traffic nowadays. Employing a range of adaptive mechanisms, the application configures video streaming to meet the requirements of the communication and constraints of the underlying network. While other related works focus on passive network monitoring of Skype data flows, this paper studies Skype as a point-to-point streaming engine of video, beyond standard video calling. The emphasis is on the objective video quality as perceived by viewers. We built a testbed to generate network perturbations and stream certain videos between Skype nodes. We examine how network impairments affect objective metrics (i.e. PSNR and SSIM index) of the video and Skype’s ability to reconstruct the original sample. The results suggest that Skype is weak at delivering good quality for high motion videos and that it is slow at recovering after a long period of high packet loss. Categories and Subject Descriptors D.2.4 [Distributed Systems]: Distributed Applications. C.4 [Performance of Systems]: Fault tolerance, H.4.3 [Communications Applications]: Computer conferencing, teleconferencing and videoconferencing. General Terms Measurement, Performance, Experimentation, Human Factors Keywords Skype, Video Quality, Real-time streaming, Peak signal-to-noise ratio, Structural similarity index 1. INTRODUCTION As real-time multimedia delivery, e.g. video calling, evolves to an important network application, the requirements for high quality video increase. Real-time streams are network- and time- constrained flows of data; packets need to survive the impairments introduced by network channels and reach the destination within certain time frames. Skype is a well-known video calling application that has developed customized video flow control mechanisms. Video packets, lost, delayed or extensively impaired in network channels, might be useless at their destination. Skype assumes certain characteristics (e.g. head and shoulders video call sessions) of the delivered video to optimize its perceived quality. However, with the release of SkypeKit 1 recently, more and more devices will soon appear to stream Skype videos. The traditional video calling setting will significantly change under the pressure of multiple Skype video-enabled ubiquitous devices and a big range of video content dynamics. With this study, we try to understand whether Skype could serve as a more general-purpose point-to- point video streaming engine. Video calls generate inelastic traffic, i.e. very little buffering is allowed, and their content is usually produced and consumed at the edges of the network. We assume that Skype targets calls between people at a relatively static position but its own adoption by numerous devices push for more varied content. For instance, lecture streaming of e-learning platforms does not exhibit the variety of outdoor video recording but its dynamics are still higher than a ‘head&shoulder’ video call. This means that Skype would have to optimize its video protocol allowing for more video types. The motivation of our work is the design of a mechanism that estimates the perceived cost of network conditions on Skype video delivery. To this extend, we try to assess in an objective way the impact on perceived quality of a Skype video call under highly lossy channels. Skype is an application sitting at the edges of a network enabling video streaming under variant conditions of both video source and sink. Traditional video streaming schemes lack the necessary flexibility which Skype tries to achieve. Our contribution lies on the testbed we developed and the objective video quality analysis we carried out on input videos. We streamed raw video files with Skype via a lossy channel. Transmitted videos were recorded at the receiver side and aligned with the original input samples. Finally, we carried an analysis of the videos with objective quality of experience metrics such as PSNR [1] and SSIM [2]. The results show the weaknesses of Skype protocol with high motion videos and fast changes to the network channel conditions. This paper is organized in six sections. After introduction and related work, section 3 describes the setup of the testbed we built to run the objective video quality measurements on Skype video calls. We carry on with the design of the actual experiments in section 4. Section 5 hosts the results and a discussion on them. Overview of the conclusions and future work plans are included in section 6. 2. RELATED WORK Despite the time constraints of video calls, Skype operates on best-effort network pipes that cannot provide guarantees about the delivered video quality. However, it is a prominent streaming application as it manages to deliver acceptable quality. As its 1 https://developer.skype.com/public/skypekit Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MoMM’11, December 5–7, 2011, Hoe Chi Minh City, Vietnam. Copyright 2011 ACM 978-1-4503-0785-7/11/0012…$10.00.

Upload: tue

Post on 10-Nov-2023

0 views

Category:

Documents


0 download

TRANSCRIPT

Can Skype be used beyond video calling?

Georgios Exarchakos Technische Universiteit Eindhoven

The Netherlands +31402475653

[email protected]

Vlado Menkovski Technische Universiteit Eindhoven

The Netherlands +31402475653

[email protected]

Antonio Liotta Technische Universiteit Eindhoven

The Netherlands +31402473890

[email protected]

ABSTRACT

Skype nodes generate a substantial part of real-time bi-directional video traffic nowadays. Employing a range of adaptive mechanisms, the application configures video streaming to meet the requirements of the communication and constraints of the underlying network. While other related works focus on passive network monitoring of Skype data flows, this paper studies Skype as a point-to-point streaming engine of video, beyond standard video calling. The emphasis is on the objective video quality as perceived by viewers. We built a testbed to generate network perturbations and stream certain videos between Skype nodes. We examine how network impairments affect objective metrics (i.e. PSNR and SSIM index) of the video and Skype’s ability to reconstruct the original sample. The results suggest that Skype is weak at delivering good quality for high motion videos and that it is slow at recovering after a long period of high packet loss.

Categories and Subject Descriptors

D.2.4 [Distributed Systems]: Distributed Applications. C.4 [Performance of Systems]: Fault tolerance, H.4.3 [Communications Applications]: Computer conferencing, teleconferencing and videoconferencing.

General Terms

Measurement, Performance, Experimentation, Human Factors

Keywords

Skype, Video Quality, Real-time streaming, Peak signal-to-noise ratio, Structural similarity index

1. INTRODUCTION As real-time multimedia delivery, e.g. video calling, evolves to an important network application, the requirements for high quality video increase. Real-time streams are network- and time-constrained flows of data; packets need to survive the impairments introduced by network channels and reach the destination within certain time frames.

Skype is a well-known video calling application that has developed customized video flow control mechanisms. Video packets, lost, delayed or extensively impaired in network channels, might be useless at their destination. Skype assumes

certain characteristics (e.g. head and shoulders video call sessions) of the delivered video to optimize its perceived quality. However, with the release of SkypeKit1 recently, more and more devices will soon appear to stream Skype videos. The traditional video calling setting will significantly change under the pressure of multiple Skype video-enabled ubiquitous devices and a big range of video content dynamics. With this study, we try to understand whether Skype could serve as a more general-purpose point-to-point video streaming engine.

Video calls generate inelastic traffic, i.e. very little buffering is allowed, and their content is usually produced and consumed at the edges of the network. We assume that Skype targets calls between people at a relatively static position but its own adoption by numerous devices push for more varied content. For instance, lecture streaming of e-learning platforms does not exhibit the variety of outdoor video recording but its dynamics are still higher than a ‘head&shoulder’ video call. This means that Skype would have to optimize its video protocol allowing for more video types.

The motivation of our work is the design of a mechanism that estimates the perceived cost of network conditions on Skype video delivery. To this extend, we try to assess in an objective way the impact on perceived quality of a Skype video call under highly lossy channels. Skype is an application sitting at the edges of a network enabling video streaming under variant conditions of both video source and sink. Traditional video streaming schemes lack the necessary flexibility which Skype tries to achieve.

Our contribution lies on the testbed we developed and the objective video quality analysis we carried out on input videos. We streamed raw video files with Skype via a lossy channel. Transmitted videos were recorded at the receiver side and aligned with the original input samples. Finally, we carried an analysis of the videos with objective quality of experience metrics such as PSNR [1] and SSIM [2]. The results show the weaknesses of Skype protocol with high motion videos and fast changes to the network channel conditions.

This paper is organized in six sections. After introduction and related work, section 3 describes the setup of the testbed we built to run the objective video quality measurements on Skype video calls. We carry on with the design of the actual experiments in section 4. Section 5 hosts the results and a discussion on them. Overview of the conclusions and future work plans are included in section 6.

2. RELATED WORK Despite the time constraints of video calls, Skype operates on best-effort network pipes that cannot provide guarantees about the delivered video quality. However, it is a prominent streaming application as it manages to deliver acceptable quality. As its

1 https://developer.skype.com/public/skypekit

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MoMM’11, December 5–7, 2011, Hoe Chi Minh City, Vietnam. Copyright 2011 ACM 978-1-4503-0785-7/11/0012…$10.00.

protocol is proprietary, a number of studies have tried to characterize its traffic in different environments and conditions.

The vast majority of these studies analyze the voice streaming protocol. Bonfiglio et al [3] in 2007 designed a set of classifiers to automatically detect Skype voice traffic from an aggregate of captured data packets. The classification capabilities of the method presented in [3] is further extended in [4], [5]. Dario et al. in [6] follow a similar analysis technique and give a fine-grained characterization of both Skype packets and flows. Te-Yuan Huang et al. in [7] investigate the reason of Skype success. They conclude that much of its acceptance and success to deliver satisfactory voice streams lies on its forward error correction (FEC) mechanism. Skype embeds and multiplexes in voice streams extra data that then allows the destination to recover the stream from network impairments. The authors of [7] identify the details of Skype FEC mechanism and compare it with their own optimal FEC algorithms for certain non-proprietary codecs. De Cicco and Mascolo in [8] take a step beyond Skype Voice-over-IP traffic characterization and provide a mathematical model for the congestion control that Skype uses for voice calls. They concluded that Skype sends at a rate, which is highly correlated to the packet loss of the voice stream.

After Skype introduced video calling, similar traffic analysis studies appeared. De Cicco et al. in [9] describe a traffic analysis methodology for Skype videos in channels with bandwidth variations. They built a testbed to modify the available bandwidth and study the reaction of video streams co-existing with other TCP traffic. The authors of [10] carry experiments on Skype video protocols and identify signaling traffic generated by nodes to adapt video streams to network conditions. They also identify characteristics of users by just tracing Skype traffic.

Besides traffic characterization of Skype voice and/or video streams, there is lately some research output on the quality assessment of those streams. In 2006, Kuan-Ta Chen et al. [11] proposed a user satisfaction index based on the duration of a Skype voice call. The index was based solely on objective metrics that can be retrieved in real-time. However, the user perception of quality is not directly measured through objective Quality of Experience metrics such as MOS. QoE is rather inferred from the call duration, which the authors indicate to be providing a strong indication of user satisfaction.

Hossfeld and Binzenhoefer in 2008 [12] presented a QoS and QoE analysis of Skype voice calls in UMTS networks. Instead of tracing Skype packets, the authors streamed with Skype a specific voice recording and studied the behavior of the protocol under packet loss and delay impariments. An important part of their work was the quality assessment of the received signals based on Mean Opinion Score [13]. Cano and Cerdan [14] in 2010 published a comparison of four VoIP applications, including Skype, with regards to delivered quality carrying subjective studies. More specifically for video calling, they assess the quality with Absolute Category Rating (ACR) and Degradation Category Rating (DCR) methods [15]. They concluded that Skype, besides other competing applications, reach a high QoS but not QoE; there is a mismatch between the two dimensions.

Finally, Jing Zhu in 2011 [16] published the results of his experiments on Skype video call quality. The aim of the paper is to study the stream behavior over a WiMAX network. The author pushed encoded video files via a virtual camera which was then used as input to Skype call. The stream went through a typical WiMAX network configuration and the video was recorded at the receiver side. To align reference and recorded video, the author

describes a cross correlation method. Once videos are aligned, the quality is assessed based on effective frame rate and mean opinion score.

While many studies tried to provide experimental or formal characterization of Skype streams, only [16] moved to the objective assessment of delivered video. While [16] tries to assess the Skype video quality over WiMAX network pipes, in this work we try to understand how exactly changes on network pipes may affect the delivered quality as perceived by users. We devise a more accurate (non probabilistic) method for aligning sample and recorded videos and assess the quality based on more accurate objective metrics. The second dimension we study is the use of Skype as a streaming tool with videos beyond the simple head&shoulder types. We try to understand to which extend Skype may employed beyond simple video calling.

3. EXPERIMENTAL SETUP The objective of this work is to compare the delivered quality of a video through skype, for a range of network conditions imposed by us. We have developed a testbed for streaming and recording a video over Skype and then analyze its quality compared to the initial one. We carry out our tests in five steps:

1. Input raw video in a Skype stream instead of the default camera output;

2. Impair video stream before delivered to the other side;

3. Record delivered video at the other side;

4. Synchronize the sequences of original and recorded frames;

5. Compare input and output videos frame-by-frame and use objective metrics for QoE assessment to compute the loss of quality.

An important reason why we used a custom testbed trying to trick Skype was the lack of documentation on Skype’s proprietary streaming protocol. Hence, we cannot assume Real-time Transport Protocol (RTP) or other well-known protocol. As shown in Figure 1, the overall architecture consists of three hosts: node A hosts the Skype node and video source; B is the Skype video sink; and C impairs all the traffic between source and sink. A and B belong to different domains and are directly connected to C via crossover cables. C can forward traffic generated by A and B to the Internet but blocks any connection attempted from Internet to them.

Figure 1. Experimental setup for video quality assessment

This configuration allows Skype nodes to authenticate and bootstrap through the Skype servers but does not allow traffic between each other via the Internet. In brief, A and B are forced to communicate only via C. A and B run on Windows systems to allow for the latest Skype protocol. The impairment box uses netem on CentOS to change the network interface configuration. As connections to B are only allowed by A, C impairs the output of its interface connected to B ensuring that the impairments affect only the video stream.

We developed a Skype plugin to automatically select the video from file system, read user input for frame size and rate and finally start a Skype video call to a specific user at the sink B. The plugin at the same time notifies C to impair the current stream. The original Skype application can only accept input from a camera. As Skype has not released the video API of the application, we used e2eSoft VCam2 to fake the camera input. Other options for virtual camera drivers are considered, like gstreamer3; however, VCam was the only to offer open SDK for Windows to connect it with the Skype4COM API.

Avoiding loss of information, input video was in raw RGB24 format read from a binary file and pushed to the VCam driver frame by frame at certain rate. With the use of raw video file, we eliminate any loss of information before the video frames have been handed over to Skype for streaming. Moreover, it enables direct pixel-by-pixel comparisons between original and recorded videos. Encoding the video before pushing it into VCam driver would make impossible for Skype to apply its own encoding schemes.

Skype video recording was an issue and many available tools were tested. However, none of these tools was found to record robustly at rates higher than 25 frames per second, at any resolution and at raw RGB24 format. Thus, we used Camtasia Studio to record a screen big enough to handle the resolution of the video and store it in raw format. This way of recording was introducing a number of problems such as:

• Oversampling: most of the times Skype was unable to deliver 30 frames per second but Camtasia was recording constantly 30fps.

• Regardless the network impairments, frames were delivered with a varying time lag. Hence, the recorded video was stretched in a non-uniform way.

• Many frames were lost especially at the beginning of the stream or when high packet loss was imposed in the stream.

Aligning the recorded frames with the transmitted frames had to take into account all these distortions on the stream. We introduced frame indexing within the video stream. The idea behind this indexing technique is a black and white progress bar embedded in each frame which grows for every transmitted frame. The sink can calculate which frame number was actually received or dropped. The width of each step of that progress bar (i.e. number of white pixels added to the previous step) has to be big enough to survive even the most severe impairments. However, the length of the bar might get much bigger than the width of the frame depending on the number of frames transmitted.

2 http://www.e2esoft.cn/vcam/ 3 http://gstreamer.freedesktop.org/

Let us assume a video with frame width w (e.g. w=640) pixels wide and a step of the bar equal to b (e.g. b=5) pixels wide. That

is, a w-wide bar can accommodate � � ��� (e.g. � � 640

5� �128) frames. Every r frames a new cycle has to start. Therefore, we embedded a cyclic progress bar, which fills up from left to right with white pixels during odd rounds and with black pixels during the even rounds. Due to distortions explained above, many frames were repeated in the recorded video. The indexing technique with the progress bar helped us identify the distinct frames as well as the number of their repetitions. This index was enough to reconstruct the video with exactly the same number of frames as the original. In case of a gap, i.e. lost frames, the last frame before the gap was used.

The procedure above resulted in RGB24 videos recorded from Skype containing the impairments introduced by Skype codec and impairment box. The size of the videos was the same as their originals; hence, peak signal-to-noise ratio (PSNR) and structural similarity analysis could be applied.

4. DESIGN OF EXPERIMENTS For these experiments, we used a video file ‘Shields’ picked from the Live Video Quality Database4. The original file was a YUV420 video of 250 frames at 1920x1080 resolution. We have chosen a video with portion of high and low motions which allows to test both conditions at the same time. As shown in Figure 2, the chosen video has long periods of high temporal information index and short periods of low motion.

Figure 2. ‘Shields’ sample video with long periods of high

temporal information idex and smaller periods of low

temporal information index

Using ffmpeg5 we downscaled and converted the video to RGB24 format at 640x320 resolution. Concatenating the same video four times, we produced a 1000-frame clip. The Skype plugin developed at the source side was reading the file frame by frame, adding the appropriate step of the progress bar on top of the video frame thus producing frames of 640x480 pixels, as shown in Figure 3. Then, frames were pushed to the VCam driver at 25fps rate; that is, the whole video lasted 40 seconds.

The example of Figure 3 illustrates four frames from the transmitted video. The video alignment engine at the sink extracts the top 160px, increases the contrast and measures the bar length. Assuming the frame is at cycle K, if K is an odd number then the progress bar is white; otherwise, the bar is black. For instance, the white bars of Figure 3 have length i and j respectively and represent the ith and jth frame of Kth cycle. Symmetrically, the black bars represent the mth and nth frames of (K+1)th cycle.

4 http://live.ece.utexas.edu/research/quality/live_video.html 5 http://ffmpeg.arrozcru.org/builds/

0

2

4

6

8

10

0 1 2 3 4 5 6 7 8 9 10

tem

po

ral i

nfo

rma

tio

n

seconds

Figure 3. Frames with progress bars as sent via VCam to

Skype. Snapshots i, j belong to cycle K and m, n belong to K+1

Both source and impairment boxes were running on a Quadcore Intel Xeon 3430 series and a 4GB DDR3 memory. Skype at source and sink were the latest version 5.3.0.120. The sink had a Intel Core Duo at 2.5GHz processor and a 4GB DDR3 memory.

We defined three experiments, based on three impairment levels (Figure 4). The first experiment was imposing no impairments (no packet loss) to the stream while the second and third were uniformly dropping packets (low-loss and high-loss patterns); that is, all packets have the same probability to be dropped. In details, the second experiment was simulating scalar small increase of packet loss rate whereas the third a faster drop rate. Packet delay or jitter was not part of the experiments as we wanted to investigate the ability of Skype protocol to reconstruct the original video. More specifically, we want to find a measurable cost of packet loss on Skype video quality.

Figure 4. Packet loss rate over the 40 seconds of the video for

the two low- and high-loss experiments

At the sink, Camtasia Studio 7.1 was recording 30 frames per second without compression. The received resolution of the video was smaller than the screen resolution and Camtasia was configured to record the video area only i.e. 640x480 pixels. The recorded video was cropped to start exactly at the first received frame and finish at the last displayed. The resulting file was split into two videos: one containing the progress bar (640x160 pixels) and the second containing the real content video (640x320 pixels). The index was calculated processing the progress bar video, which was used to filter the recorded content and to produce the final content video. To reduce the probability of errors while calculating the index of each frame, we maximized the contrast of progress bar frames. Figure 5 gives a schematic description of this procedure.

Figure 5. Skype video quality assessment process

Finally, we used the MSU VQMT tool6 to compute the PSNR and SSIM values of the final recorded content video, compared to the original one.

5. RESULTS AND DISCUSSION Aligning the recorded and original videos revealed that many frames were never displayed at the sink, even on the experiment without impairments. This reflects the inability of the Skype player to reconstruct the dropped frames, either because its streaming counterpart never sent them or because not enough information was available on time to build the frames. As Figure 6 illustrates, more than 400 out of 1000 frames at no-loss and low-loss experiments were never displayed. The number of frames displayed exactly once is almost the same for the two experiments with a slight advantage of the no-loss. The frames displayed twice or more times practically covered gaps left by dropped frames.

The vast majority of frames at high-loss experiment are never displayed. Very few of the frames are displayed exactly once and a small portion of frames is displayed above 15 times (i.e. one frame appears 77 consecutive times).

Figure 6. Histogram showing the number of frames that were

dropped or displayed once, twice or more times, respectively

Figure 7, Figure 8 and Figure 9 illustrate (a) the PSNR values and (b) the frame drop rate for the no-loss, low-loss and high loss experiments respectively. The PSNR values in Figure 7(a), Figure 8(a) and Figure 9(a) reflect the frame-per-frame comparison of the original and recorded video. To make the figures more readable, we displayed the running average of eleven PSNR values; that is

6

http://compression.ru/video/quality_measure/video_measurement_tool_en.html

i j

m n

640px

0%

5%

10%

15%

20%

0 5 10 15 20 25 30 35 40

Pa

cke

t L

oss

Ra

te

seconds

Low loss High Loss

YUV420 to RGB24

Embed progress

bar

Deliver to VCam

Stream with Skype

Calculate frame

index

Extract progress

bar

Capture with

Camtasia Studio

Receive with Skype

Reconstruct video

PSNR and SSIM

Packet

loss

Video source Video sinkImpairment box

Read YUV frame

from file

0

100

200

300

400

500

600

700

800

900

dropped displayed

once

displayed

twice

displayed 3

times

displayed

[4,15]

displayed

>15

nu

mb

er

of

fra

me

s No loss Low loss High loss

0

1

2

3

16 17 19 24 28 30 31 49 62 63 73 77

the average of the PSNR values for the last five, the current and the next five frames.

Figure 7. No-loss experiment: (a) PSNR analysis per frame

and (b) frame drop rate per second

A closer look to the frame loss rate per second shows how the characteristics of the streamed video are reflected to the recorded video. Figure 7(b), Figure 8(b) and Figure 9(b) illustrate the percentage of dropped frames per second for four cycles. Each cycle is one repetition of the original video (250 frames). When the video shifts from one cycle to the next or there is an increase in packet loss; there is also an increase in frame drop rate. This becomes more noticeable in the high-loss experiment. The temporal difference between the last frames of a cycle and the first ones of the next cycle is high. This forces Skype to send more data. In case of high packet loss rate, frames cannot be reconstructed and, hence, are dropped.

Figure 8. Low-loss experiment: (a) PSNR analysis per frame

and (b) frame drop rate per second

Based on these figures, PSNR values go down as frame drop rate increases. For every dropped frame, the last available one is displayed. Hence, the more consecutive frames are dropped the lower the PSNR gets.

Figure 9. High-loss experiment: (a) PSNR analysis per frame

and (b) frame drop rate per second

There is a characteristic part of these figures just around each cycle finishes and the next one starts; the PSNR values increase indicating a better quality of the video but as soon as the cycle restarts there is an abrupt drop. This suggests that Skype may easily handle low motion videos; but high-motion videos will suffer from quality degradation even at low packet loss rates. Despite the packet loss of the second experiment, the PSNR values seem to be comparable to those of the first experiment (average PSNR: 22.196dB over 22.377dB respectively).

Figure 10. Structural Similarity index of recorded video

compared to the original one. (a) no-loss, (b) low-loss and (c)

high-loss experiments

The situation is significantly different with the third experiment. As soon as the packet loss jumps to 20%, Skype seems to be

41.3%

0%

20%

40%

60%

80%

100%

0 5 10 15 20 25 30 35 40

fra

me

dro

p r

ate

seconds

22.377

0

5

10

15

20

25

30

PS

NR

(d

B)

No Loss

1st cycle 2nd cycle 3rd cycle 4th cycle

(a)

(b)

41.6%

0%

20%

40%

60%

80%

100%

0 5 10 15 20 25 30 35 40

loss

ra

te

seconds

Frame Loss rate

Packet Loss rate

22.196

0

5

10

15

20

25

30

PS

NR

(d

B)

1st cycle 2nd cycle 3rd cycle 4th cycle

(a)

(b)

83.3%

0%

20%

40%

60%

80%

100%

0 5 10 15 20 25 30 35 40

loss

ra

te

seconds

Frame Loss rate

Packet Loss rate

15.941

0

5

10

15

20

25

30

PS

NR

(d

B)

High Loss(a)

(b)

71.264%

0%

20%

40%

60%

80%

100%

SS

IM

70.633%

0%

20%

40%

60%

80%

100%

SS

IM

43.218%

0%

20%

40%

60%

80%

100%

0 5 10 15 20 25 30 35 40

SSIM

seconds

(b)

(a)

(c)

unable to display most of the frames. However, PSNR values do not go up even when the packet loss drops back to 10% and then 5%; only when packet loss drops to 1%, the PSNR values start increasing again. This suggests that Skype avoids to immediately adapting to packet loss changes, which has a detrimental effect on high-motion video.

Besides the PSNR analysis, we produced the SSIM values, which gives a more accurate account of human perception. Figure 10 shows the similarity index for the three experiments, again as a running average of 11 frames. As indicated from the analysis above, the first two experiments produced quite similar videos. The fast alternations between low and high similarity index occurs because of the missing frames.

The third experiment produced a much less similar video. After the twentieth second, the similarity index is mainly under 30% except for certain spikes. These spikes indicate which new frame was displayed; the distance between two consecutive spikes represents the duration a frame was displayed, replacing missing frames. Finally, even if frames are displayed the similarity index does not reach the levels of the other two experiments. The high packet loss forces the Skype rendering algorithm to skip frames and, hence, to ignore impairments within the displayed frames.

The conclusions above are also supported by checking the average frame drop rate per packet loss level. We averaged the frame loss rate per level of packet loss introduced in the Skype stream. We then compared it against the frame loss rate of the no-loss experiment. These results are illustrated in Figure 11 and Figure 12 for low-loss and high-loss experiments, respectively. The graphs of average frame drop rate for no-loss experiments on the two figures are different because of the different averaging intervals. In Figure 11, the average value is calculated based on the steps of low-packet loss pattern whereas, in Figure 12, based on the high-packet loss pattern.

Figure 11. Average frame drop rate for each level of low

packet loss pattern

These figures show high similarities between the no-loss and low-loss experiments. However, Figure 11 reveals a considerable difference in frame drop rate per packet loss level. For the first two levels, the stream seems to react negatively to packet loss, dropping frames at the sink. When packet loss reaches 5%, Skype reacts with a FEC mechanism up to 10% packet loss, and the frame drop rate decreases. However, even if packet loss drops to 0%, the frame drop rate has still a considerable difference compared to the no-loss experiment.

Figure 12. Average frame drop rate for each level of high

packet loss pattern

Skype’s inefficiency to regain a low frame drop rate after a high packet loss peak becomes more evident in Figure 12. Even if the packet loss starts dropping after the peak of 20%, frame drop rate keeps increasing. A slight decrease starts only when packet loss is 1%, after more than 30 seconds of video.

The average SSIM index per packet loss level (Figure 13 and Figure 14) demonstrates the weaknesses of Skype. For instance, despite the big frame drop rate per low packet loss level, the SSIM per level is comparable to that of no-loss experiment. However, Figure 14 shows a clear problem of Skype to handle high-loss peaks and especially to recover as soon as the packet loss also decreases.

Figure 13. Average SSIM per low packet loss level

Figure 14. Average SSIM per high packet loss level

The last packet loss level of Figure 14 shows an improvement of SSIM. Based on Figure 11, this cannot be the result of lower frame drop rate. Thus, Skype prefers to correct the quality of

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 5 10 15 20 25 30 35 40seconds

Avg. frame drop (no loss)

Avg. frame drop (low loss)

Low packet loss pattern

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 5 10 15 20 25 30 35 40seconds

Avg. frame drop (no loss)

Avg. frame drop (high loss)

High packet loss pattern

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 5 10 15 20 25 30 35 40seconds

SSIM (no loss)

SSIM (low loss)

Low packet loss pattern

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

0 5 10 15 20 25 30 35 40seconds

SSIM (no loss)

SSIM (high loss)

High packet loss pattern

certain frames before injecting new frames in the channel. This is an indication that Skype assumes low motion videos such as typical video calling content.

6. CONCLUSIONS AND FUTURE WORK With this work we try to establish an objective video quality assessment for Skype real-time streams. The aim was to identify the correlations of network impairments with video quality at Skype streams. We managed to stream specific raw video files and impair the channel with packet loss. The content was recorded as it was rendered [17] from Skype at the sink of the stream. We accurately aligned original and recorded video, to allow for reliable PSNR and SSIM comparisons.

The results helped us draw two conclusions. Skype seems to be more efficient at handling videos with low motion such as a typical head and shoulders video call. Moreover, Skype protocol cannot or chooses not to recover fast enough, after high packet loss peaks. From the results, we also saw that Skype mechanisms tackle very efficiently packet loss of 0-10%. However, packet loss rates over 15% because serious problems to the delivered quality of the input video used in the experiments above.

Ongoing work and future efforts focus on carrying more accurate experiments that will strongly correlate a variety of network impairments with Skype video quality. For this purpose, we will test a bigger set of videos with various dynamics. Moreover, the use of progress bars as a frame alignment tool might introduce certain doubts to the reliability of results. To this extend, we are designing an internal alignment mechanism via frame-by-frame watermarking. Within our future plans is also to assess Skype video quality with subjective methods and Adaptive MLDS [18].

7. REFERENCES [1] I. Avcibas, B. Sankur, and K. Sayood, “Statistical

evaluation of image quality measures,” Journal of

Electronic Imaging, vol. 11, no. 2, p. 206, Apr. 2002.

[2] Z. Wang, “Video quality assessment based on structural distortion measurement,” Signal Processing: Image

Communication, vol. 19, no. 2, pp. 121-132, Feb. 2004.

[3] D. Bonfiglio, M. Mellia, M. Meo, D. Rossi, and P. Tofanelli, “Revealing skype traffic,” ACM SIGCOMM

Computer Communication Review, vol. 37, no. 4, p. 37, Oct. 2007.

[4] D. Bonfiglio, M. Mellia, M. Meo, N. Ritacca, and D. Rossi, “Tracking Down Skype Traffic,” in 2008 IEEE

INFOCOM - The 27th Conference on Computer

Communications, 2008, pp. 261-265.

[5] D. Bonfiglio, M. Mellia, M. Meo, and D. Rossi, “Detailed Analysis of Skype Traffic,” IEEE Transactions

on Multimedia, vol. 11, no. 1, pp. 117-127, Jan. 2009.

[6] D. Rossi, M. Mellia, and M. Meo, “A detailed measurement of skype network traffic,” in IPTPS’08

Proceedings of the 7th international conference on Peer-

to-peer systems, 2008, p. 12.

[7] T.-Y. Huang, P. Huang, K.-T. Chen, and P.-J. Wang, “Could Skype be more satisfying? a QoE-centric study of the FEC mechanism in an internet-scale VoIP system,” IEEE Network, vol. 24, no. 2, pp. 42-48, Mar. 2010.

[8] L. De Cicco and S. Mascolo, “A Mathematical Model of the Skype VoIP Congestion Control Algorithm,” IEEE

Transactions on Automatic Control, vol. 55, no. 3, pp. 790-795, Mar. 2010.

[9] L. De Cicco, S. Mascolo, and V. Palmisano, “Skype video responsiveness to bandwidth variations,” in Proceedings of the 18th International Workshop on

Network and Operating Systems Support for Digital

Audio and Video - NOSSDAV ’08, 2008, p. 81.

[10] D. Bonfiglio, M. Mellia, M. Meo, D. Rossi, and P. Tofanelli, “Revealing skype traffic,” ACM SIGCOMM

Computer Communication Review, vol. 37, no. 4, p. 37, Oct. 2007.

[11] K.-T. Chen, C.-Y. Huang, P. Huang, and C.-L. Lei, “Quantifying Skype user satisfaction,” ACM SIGCOMM

Computer Communication Review, vol. 36, no. 4, p. 399, Aug. 2006.

[12] T. Hosfeld and A. Binzenhofer, “Analysis of Skype VoIP traffic in UMTS: End-to-end QoS and QoE measurements,” Computer Networks, vol. 52, no. 3, pp. 650-666, Feb. 2008.

[13] P.800.1: Mean Opinion Score (MOS) terminology. 2006, p. 12.

[14] M.-D. Cano and F. Cerdan, “Subjective QoE analysis of VoIP applications in a wireless campus environment,” Telecommunication Systems, pp. 1-11-11, Jun. 2010.

[15] P.910:Subjective video quality assessment methods for

multimedia applications. 2008, p. 42.

[16] J. Zhu, “On traffic characteristics and user experience of Skype video call,” in 2011 IEEE Nineteenth IEEE

International Workshop on Quality of Service, 2011, pp. 1-3.

[17] V. Menkovski, G. Exarchakos, A. Liotta, and A. C. Sánchez, “Quality of Experience Models for Multimedia Streaming,” International Journal of Mobile Computing

and Multimedia Communications, vol. 2, no. 4, pp. 1-20, Jan. 2010.

[18] V. Menkovski, G. Exarchakos, and A. Liotta, “Tackling the sheer scale of subjective QoE,” in Proceedings of the

7th International ICST Mobile Multimedia

Communications Conference, 2011.