chapter 5: transport layer1 computer networks an open source approach chapter 5: transport layer

104
Chapter 5: Transport Layer 1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Upload: charlotte-watts

Post on 19-Jan-2016

240 views

Category:

Documents


5 download

TRANSCRIPT

Page 1: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 1

Computer NetworksAn Open Source Approach

Chapter 5: Transport Layer

Page 2: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

2

Content

5.1 General Issues Port-Multiplexing, Reliability, Flow/Congestion Control

5.2 UDP - Unreliable Connectionless Transfer Port-Multiplexing

5.3 TCP - Reliable Connection-Oriented Transfer Connection Management Reliability Flow Control Performance Enhancements

5.4 Socket Programming Interface 5.5 Real-time Transport (RTP & RTCP) 5.6 Summary

Chapter 5: Transport Layer

Page 3: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 3

5.1 General Issues

End-to-End Communication ChannelData IntegrityFlow ControlSocket Programming Interface

Page 4: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 4

5.1 General Issues End-to-End Communication Channel: Port-

Multiplexing Port: communication end point

Multi-Access Channel

MACMAC IP Network TCP/UDPTCP/UDP

AP1 AP2 AP1 AP2IP IP

Condense delay distribution Loose delay distribution

Node-to-Node Channel End-to-End Channel

LAN host 1 LAN host 2 IP host 1 IP host 2

Page 5: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 5

General Issues: Direct-Linked vs. End-to-End

Note: per-frame integrity such as Ethernet Collision: can be detected and be retransmitted CRC/alignment error: can only rely on upper-layer protocols

Direct-Linked Protocol Layer End-to-End Protocol Layer

base on what services? physical layer internetworking layer

services

addressingnode-to-node channel within a link. (MAC address)

process-to-process channel between hosts (port number)

error checking per-frame per-segment

data reliability per-link per-flow

flow control per-link per-flow

channel delay condensed distribution loose distribution

Page 6: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

6

Open Source Implementation 5.1: an incoming packet in the transport layer

Copyright Reserved 2009

Network layer

raw_v4_input(skb)

raw_rcv(sk,skb)

raw_rcv_skb(sk,skb)

__skb_queue_tail(sk->sk_receive_queue, skb)

sk->sk_data_ready

udp_rcv(skb)

sk=udp_v4_lookup(skb)

udp_queue_rcv_skb(sk,skb)

socket_queue_rcv_skb(sk,skb)

udp_recvmsg (sk,buf)

skb_recvdatagram

skb=skb_dequeue

tcp_v4_do_rcv(sk,skb)

tcp_rcv_established

tcp_data_queue(sk,skb)

tcp_v4_rcv(skb)

tcp_recvmsg(sk,buf)

skb_copy_datagram_iovec

raw_recvmsg (sk, buf)

sk=__raw_v4_lookup(skb) sk=inet_lookup(skb)

sk_receive_queue

RAW UDP TCPTransport layer

io_local_deliver_finish

read

sys_read

do_sock_read

sock_recvmsg

Application layer

vfs_read

do_sync_read

sock_aio_read

recvfrom

sys_socketcall

sys_recvfrom

sock_recvmsg

__sock_recvmsg

sock_common_recvmsg

6Chapter 5: Transport Layer

Page 7: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

7

Open Source Implementation 5.1: an outgoing packet in the transport layer

Copyright Reserved 2009

write

sys_write

do_sock_write

sock_sendmsg

udp_sendmsg(sk,buf)raw_sendmsg(sk,buf)

tcp_sendmsg(sk,buf)

skb_queue_tail(&sk->sk_write_queue,skb)

ip_append_data(sk, buf)

skb=sock_alloc_send_skb(sk)

ip_generic_getfrag

sk_write_queue

ip_push_pending_frames

ip_queue_xmitdst_output

skb->dst->output

ip_output

Transportlayer

Networklayer

Applicationlayer

__tcp_push_pending_frames

tcp_transmit_skb

vfs_write

do_sync_write

sock_aio_write

inet_sendmsg

tcp_push

tcp_write_xmit

sendto

sys_socketcall

sys_sendto

sock_sendmsg

inet_sendmsg

skb=sock_wmalloc(sk)

udp_push_pending_frames

7Chapter 5: Transport Layer

Page 8: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

8

5.2 UDP – Unreliable Connectionless Transfers

ObjectivesHeader FormatUnicast Real-time Applications Using UDP

Copyright Reserved 2009Chapter 5: Transport Layer

Page 9: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

9

5.2 UDP – For Unreliable Connectionless Transfers Objectives

Port-Multiplexing

Per-Segment Error Checking: Checksum Header Format

Carrying Unicast/Multicast Real-Time Traffic Retransmission is Meaningless: No Per-Flow Integrity Needed Bit-rate is Determined by Codec Used: No Flow Control Needed

IP Networks TCPTCP

AP1 AP2 AP1 AP2

IP host 1 IP host 2

source port number destination port number

UDP length UDP checksum (optional)

data (if any)

0 15 16 31

8 bytes

~~~~

Copyright Reserved 2009Chapter 5: Transport Layer

Page 10: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Open Source Implementation 5.2: UDP and TCP Checksum

Checksum in TCP/IPIn Linux 2.6:

IP Header TCP/UDP Header Application Data

csum=csum_partial(D,lenD,0)

tcp_v4_check(T, lenT, SA, DA, csum)

csum=csum_partial(T,lenT,csum)

ip_send_check(iph)

Pseudo Header

th->check = tcp_v4_check(len, inet->saddr, inet->daddr,csum_partial((char *)th,th->doff << 2,skb->csum));

Copyright Reserved 2009 10Chapter 5: Transport Layer

Page 11: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

11

5.3 TCP – Reliable Connection-Oriented Transfers

ObjectivesConnection ManagementPer-Flow Data IntegrityPer-Flow Flow/Congestion ControlPerformance Problems and Enhancements

Copyright Reserved 2009Chapter 5: Transport Layer

Page 12: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

12

5.3 TCP – For Reliable Connection-Oriented Transfers Objectives

Port-Multiplexing: Same as UDP Per-Flow Reliability Per-Flow Flow Control

Connection Management Connection Establishment/Disconnection & State Transitions

Per-Flow Data Integrity Per-Frame Checksum & Per-Flow ACKs

Per-Flow Flow/Congestion Control Performance

Interactive vs. Bulk-Data Transfers

Stateful (Ch1) !! Requires connection management

Copyright Reserved 2009Chapter 5: Transport Layer

Page 13: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

13

TCP Connection Management Establishment/Termination – 3-Way

Handshake Protocol

Establishment Termination

Copyright Reserved 2009

SYN (seq=x)

ACK of SYN (ack=x+1)

FIN

ACK of FIN

ACK of FIN

FIN

client server

SYN (seq=y)

(seq=x)ACK of SYN (ack=y+1)

client server

Chapter 5: Transport Layer

Page 14: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

14

TCP State Transition Diagram

Copyright Reserved 2009

CLOSED

LISTEN

SYN_RCVD SYN_SENT

ESTABLISHED CLOSE_WAIT

LAST_ACK

FIN_WAIT_1 CLOSING

TIME_WAITFIN_WAIT_2

recv: ACKsend: nothing

app: send datasend: SYN

app: active open

send: SYN

app: passive opensend: nothing

recv: SYN send: SYN,ACK

recv: RST

app: close or timeout

recv: SYNsend: SYN, ACK

simultaneous open

recv: ACKsend: nothing

passive close

simultaneous close

app: closesend: FIN

app: close

send: FIN

recv: SYN, ACK

send: ACK

active open

app: closesend: FIN

recv: FIN, ACK

send: nothing

recv: ACKsend: nothing

recv: FINsend: ACK

active close

data transfer state

recv: FINsend: ACK

recv: ACKsend: nothing

recv: FINsend: ACK

serverclient

timeoutsend: RST

Chapter 5: Transport Layer

Page 15: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

State Transitions: Establishment

SYN (seq=x)

ACK of SYN (ack=x+1)SYN (seq=y)

(seq=x)ACK of SYN (ack=y+1)

client server

CLOSED LISTEN

ESTABLISHED

ESTABLISHED

SYN_SENT

SYN_RCVD

15Chapter 5: Transport Layer

Page 16: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

State Transitions: Termination

FIN

ACK of FIN

ACK of FIN

FIN

client serverESTABLISHED ESTABLISHED

FIN_WAIT_1

FIN_WAIT_2

TIME_WAIT

CLOSED

CLOSE_WAIT

LAST_ACK

CLOSED2MSL timeout

16Chapter 5: Transport Layer

Page 17: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

State Transitions: Simultaneous Open/Close

Chapter 5: Transport Layer 17

SYN (seq=x)

ACK of SYN (ack=x+1)

FIN

ACK of FIN

ACK of FIN

FIN

client server

SYN (seq=y)

SYN (seq=x)ACK of SYN (ack=y+1)

client server

CLOSED LISTEN

ESTABLISHED

ESTABLISHED

SYN_SENT

SYN_RCVD

ESTABLISHED ESTABLISHED

FIN_WAIT_1

TIME_WAIT

CLOSED

CLOSING

2MSL timeout

SYN_SENT

SYN_RCVD

SYN (seq=y)FIN_WAIT_1

CLOSING

TIME_WAIT

CLOSED

2MSL timeout

(a) state transitions in simultaneous open (b) state transitions in simultaneous close

Page 18: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

State Transitions : Loss in Establishment

SYN (seq=x)

client server

CLOSED LISTEN

CLOSED

SYN_SENT

timeout

SYN (seq=x)

ACK of SYN (ack=x+1)SYN (seq=y)

client server

CLOSED LISTEN

CLOSED

SYN_SENT

SYN_RCVD

CLOSED

timeout

timeout

CLOSED

SYN (seq=x)

ACK of SYN (ack=x+1)SYN (seq=y)

(seq=x+1)ACK of SYN (ack=y+1)

client server

CLOSED LISTEN

SYN_SENT

SYN_RCVD

ESTABLISHED

CLOSED

CLOSED

timeout

LISTEN

LISTEN

(a) SYN sent by the client is lost

(b) SYN sent by the server is lost

(c) ACK of SYN sent by the client is lost

RST

RST

18Chapter 5: Transport Layer

Page 19: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

19

TCP State Transition Implementation

In “sock” structure

State names

Copyright Reserved 2009Chapter 5: Transport Layer

Page 20: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

20

Reliability of Data Transfers

Definition: Data Reliability vs. Data Integrity Data Integrity:

Successfully received packets are exactly the same as they are transmitted.

Data Reliability: Every transmitted packet is successfully received and is

exactly the same as the original transmitted one.

TCP Per-Segment Integrity: Checksum Per-Flow Reliability: Sequence Number & ACK

Copyright Reserved 2009Chapter 5: Transport Layer

Page 21: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

21

Per-Flow Data Reliability: Sequence Number & Acknowledgement Per-Flow Data Reliability: Sequence Number & ACK

ACK every successfully received data segment Segment sent but not ACK?

Dropped by some intermediate router Insufficient buffer Forced drop

Dropped by the receiver Wrong checksum

Retransmitting Lost Packets When to Retransmit Which?

Copyright Reserved 2009Chapter 5: Transport Layer

Page 22: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Data Reliability: Cumulative ACK (1/2)

DATA(Seq=100, Len=50)

timeo

ut

X

time

Client Server

ACK(Ack=180)

(a) packet loss

DATA(Seq=100, Len=50)

ACK(Ack=150)

time

Client Server

DATA(Seq=150, Len=30)

ACK(Ack=180)

Tim

eout

DATA(Seq=150, Len=30)

ACK(Ack=100)

DATA(Seq=100, Len=50)

(b) delay

DATA(Seq=100, Len=50)

Tim

eout

ACK(Ack=180)

duplicate datadrop it

22Chapter 5: Transport Layer

Page 23: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Data Reliability: Cumulative ACK (2/2)

DATA(Seq=100, Len=50)

timeo

uttime

Client Server

ACK(Ack=180)

(d) out-of-sequence

DATA(Seq=150, Len=30)

ACK(Ack=100)

DATA(Seq=100, Len=50)

ACK(Ack=150)

timeo

ut

X

time

Client Server

DATA(Seq=150, Len=30)

ACK(Ack=180)

(c) ACK loss

23Chapter 5: Transport Layer

Page 24: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Pseudo Code of Sliding Window in the Sender

Chapter 5: Transport Layer 24

SWS: send window size.n: current sequence number, i.e., the next packet to be transmitted.LAR: last acknowledgment received. if the sender has data to send

Transmit up to SWS packets ahead of the latest acknowledgment LAR, i.e., it may transmit packet number n as long as n < LAR+SWS.

endif

if an ACK arrives, Set LAR as ack num if its ack num > LAR.

endif

Page 25: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

25

Per-Flow Flow/Congestion Control Sliding Window

3 9 10

TCP Window Size( = min(RWND, CWND) )

DATA 8

DATA 7

DATA 6

ACKAck=6

ACKAck=5

Receiver

2Sending Stream

Sent & ACKed To be sentWhen window moves

Network Pipe(size=Data 4~8)

sliding

Receiving byte stream

Sender

2 3 4 5

Should maintain a retransmission queue in case of packet loss

Should maintain a out-of-order queue to re-sequence before returning to application

SND_UNASND_NXT

Copyright Reserved 2009Chapter 5: Transport Layer

Page 26: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Sliding Window : Normal Case (1/2)

3 4

DATA 9

DATA 8

6

ACKAck=7

ACKAck=6

Receiver

2

Sending Stream sliding

DATA 7

Sender

2 3 4 5

(b) Sender receives ACK(Ack=5)

3

DATA 8

DATA 7

DATA 6

ACKAck=6

ACKAck=5

Receiver

2

Sending Stream sliding

Sender

2 3 4 5

(a) Original condition

Network Pipe

TCP Window Size= min(RWND, CWND)

Sent & ACKed To be sentWhen window moves

26Chapter 5: Transport Layer

Page 27: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Sliding Window : Normal Case (2/2)

3 4 5

DATA10

DATA 9

6

ACK Ack=8

ACK Ack=7

Receiver

2

Sending Stream sliding

7 DATA 8

Sender

2 3 4 5

(c) Sender receives ACK(Ack=6)

3 4 5 6

DATA11

DATA 10

6

ACKAck=9

ACK Ack=8

2

Sending Stream sliding

7 8 DATA 9

Sender

2 3 4 5

(d) Sender receives ACK(Ack=7)

27Chapter 5: Transport Layer

Page 28: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Sliding Window : Out-of-Sequence(1/2)

DATA 8

6

ACK Ack=7

ACKAck=4

ReceiverDATA 7

Sender

2 3 4 5

(b) Sender receives ACK(Ack=4) of DATA 5

3

DATA 8

DATA 7

6

ACKAck=4

ACKAck=4

Receiver

2

Sending Stream

Sender

2 3 DATA 4 5

(a) Original condition

Network Pipe

TCP Window Size= min(RWND, CWND)

Sent & ACKed To be sentWhen window moves

32

Sending Stream

28Chapter 5: Transport Layer

Page 29: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Sliding Window : Out-of-Sequence(2/2)

3 4 5 6

DATA11

DATA 10

6

ACKAck=9

ACKAck=8

2

Sending Stream sliding

7 8

Sender

2 3 4 5

(d) Sender receives ACK(Ack=7)

6

ACKAck=8

ACKAck=7

Receiver7 DATA 8

Sender

2 3 4 5

32

Sending Stream

(c) Sender receives ACK(Ack=4) of DATA 6

DATA 9

29Chapter 5: Transport Layer

Page 30: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

30

Per-Flow Flow/Congestion Control Opening & Shrinking of Window Size

3 9 10

TCP Window Size( = min(RWND, CWND) )

2

Open Shrink Close

Copyright Reserved 2009Chapter 5: Transport Layer

Page 31: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

31

Retransmitting Lost Packets

Retransmit Which Packet? Fast Retransmit Towards Better Accuracy: TCP SACK Option

Further Refinement: FACK (based on SACK)

When to Retransmit? Fast Retransmit: same as above Retransmission Timeout (RTO)

Round-Trip Time (RTT) Measurement Tradeoff: RTT vs. RTO

Karn’s Algorithm Towards Better RTO: TCP Timestamp Option

Copyright Reserved 2009Chapter 5: Transport Layer

Page 32: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

32

Retransmit Which Packet?

Fast Retransmit Duplicate ACKs

Packet Reordering Packet Loss Internet Route Change

TCP Receiver ACK the First “Hole” Triple Duplicate ACKs (TDA)

4 Same ACKs (ACK field=X) TCP Sender Infer TDA as Congestion Retransmit the Packet with SeqNum=X Halve Its Sending Rate

2 3 6 7

3 4 4 4

8

4

2 3 6 7 8

Time at Receiver

ACK

DATA

Copyright Reserved 2009Chapter 5: Transport Layer

Page 33: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

33

When to Retransmit?

Retransmission TimeOut (RTO) Round-Trip Time (RTT) Measurement vs. RTO

RTT: Varying Dramatically Smoothed RTT (SRTT) : Exponential Weighted Moving

Average Mdev: Mean Deviation of RTT

RTO=SRTT+4*Mdev

Karn’s Algorithm Don’t Update RTO When Retransmission is Also Lost

Copyright Reserved 2009Chapter 5: Transport Layer

Page 34: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

34

TCP RTT Estimator

Fast estimator by Van Jacobson ’88 & ‘90 srtt (smoothed rtt) is kept in 8 times RTT mdev is kept in 4 times the real mean deviation In tcp_input.c from Linux 2.6:

exponential weighted moving average

Copyright Reserved 2009Chapter 5: Transport Layer

Page 35: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

35

Per-Flow Flow/Congestion Control How Fast to Send?

Fast Sender vs. Slow Receiver How to Know?

Feedback RWND (Receiver Advertised Window) in ACK by Receiver

Fast Sender vs. Congested Network How to Know?

Feedback Loss Events by Network Re-adjust (Congestion Window) CWND

How Fast? Satisfy Both: min (RWND, CWND)

Copyright Reserved 2009Chapter 5: Transport Layer

Page 36: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

TCP Tahoe Congestion Control

Slow start Congestion

avoidance

Retransmission

timeout

Fast

retransmit

timeout

all ACKed

cwnd ≧ ssth

≧ 3 duplicate ACK

timeout

≧ 3 duplicate ACK

start

ACKcwnd=cwnd+1

send packet ACK

cwnd=cwnd+ 1/ cwndsend data packet

send missing packetssth=cwnd/2cwnd=1

cwnd=1

36Chapter 5: Transport Layer

Page 37: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

37

Slow Start & Congestion Avoidance Slow Start Congestion Aviodance

source destination source destination

Copyright Reserved 2009Chapter 5: Transport Layer

Page 38: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

An example: TCP Tahoe (1/2)

cwnd=8awnd=8

38 37 36 35 34 33 32 31

cwnd=8awnd=8

Sender sent segment 31-38

Receiver replied seven duplicate ACKs of segment 30

S

S

D

D

(1)

(2)

38Chapter 5: Transport Layer

Page 39: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

An example: TCP Tahoe (2/2)

cwnd=1awnd=8

30 30

cwnd=1awnd=8

38

31

cwnd=1awnd=1

Sender received three duplicate ACKs and cwnd is changed 1 packets. The lost segment 31 is retransmitted. Sender exited the fast transmit and entered the slow start state.

Receiver replied the ACK of segment 38 when it received the retransmitted segment 31.

Sender sent segment 39.

S

S

S

D

D

D

(3)

(4)

(5)

39

30 30

39Chapter 5: Transport Layer

Page 40: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

40

TCP Reno Congestion Control (RFC 2581)

Copyright Reserved 2009

Slow start Congestionavoidance

Retransmission

timeout

Fast

retransmit

Fast recovery

timeout

all ACKed

cwnd ≧ ssth

≧ 3 duplicate ACK

ssth=cwnd/2cwnd=ssthsend missing packet

timeout

>= 3 duplicate ACK = x

non-duplicate

ACK > x

timeout

start

ACKcwnd=cwnd+1

send packet

duplicate ACK

cwnd=cwnd+1send data packet

ACK

cwnd=cwnd+ 1/ cwndsend data packet

cwnd=1

cwnd=ssth

Chapter 5: Transport Layer

Page 41: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

An example: TCP Reno

cwnd=7awnd=8

cwnd=11awnd=8->11

38

31

cwnd=4awnd=3->4

Sender received three duplicate ACKs and cwnd is changed to (8/2)+3 packets. The lost segment 31 is retransmitted. Sender exited the fast transmit and entered the fast recovery state.

Receiver replied the ACK of segment 38 when it received the retransmitted segment 31.

Sender exited the fast recovery and entered the congestion avoidance state. Cwnd is changed to 4 segments.

S

S

S

D

D

D

(3)

(4)

(5)

42

30 30 30 30

39 40 41

39 40 41

41Chapter 5: Transport Layer

Page 42: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Open Source Implementation 5.4: TCP Slow Start and Congestion Avoidance if (tp->snd_cwnd <= tp->snd_ssthresh) { /* Slow start*/

if (tp->snd_cwnd < tp->snd_cwnd_clamp)tp->snd_cwnd++;

} else {if (tp->snd_cwnd_cnt >= tp->snd_cwnd) {

/* Congestion Avoidance*/if (tp->snd_cwnd < tp->snd_cwnd_clamp)

tp->snd_cwnd++;tp->snd_cwnd_cnt=0;

} elsetp->snd_cwnd_cnt++;

}}

Copyright Reserved 2009 42Chapter 5: Transport Layer

Page 43: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

43

Principle in Action: TCP Congestion Control Behaviors

slow-start

congestion avoidance

triple-duplicate ACKs

fast retransmit

pipe limitssth reset

fast recovery

Copyright Reserved 2009Chapter 5: Transport Layer

Page 44: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 44

TCP Header Format

destination port number

headerlength

U A P window size

options (if any)

data

TCP checksum

0 4 15 16 31

20 bytes

32-bit sequence number

32-bit acknowledgement number

urgent pointer

6-bit reserved R S F

source port number

~~~~

~~~~

Page 45: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 45

TCP Options

kind=0

kind=1

kind=2

kind=3

kind=8 len=10

len=3

len=4Maximum

segment size(MSS)

shiftcount

timestamp value timestamp echo reply

End of option list

No operation

Maximumsegment size

Window scalefactor

Timestamp

Page 46: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 46

TCP Options End of Option List

As name suggests No Operation

Padding fields to a multiple of 4 bytes Maximum Segment Size

Negotiating the max transfer unit at 3-way handshake

kind=0

kind=1

kind=2 len=4Maximum

segment size(MSS)

End of option list:

No operation:

Maximumsegment size:

2 bytes1 byte1 byte

1 byte

1 byte

Page 47: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 47

TCP Options (Window Scale Factor, RFC 1323) Issue: window too small when in Gigabit

networks, causing limited throughput Solution: negotiate a shifting factor for window

Negotiate during 3-way handshaking SYN with timestamp, then SYN+ACK with timestamp

Shift up to 14 bits (from 216 to 216x214) When this option is not used:

Linux do not advertise window over 215 to avoid other stack that uses signed bit (include/net/tcp.h)

kind=3 len=3shift

countWindow scale

factor:

1 byte 1 byte 1 byte

Page 48: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 48

TCP Options – Timestamp

Mission 1 – Improving RTT measurement Receiver: copies & replies the timestamp

Delayed ACK Sender: always update RTT when seeing timestamp

Mission 2 – Protecting Wrapped SeqNum Avoid receiving old segments in high speed network

How to enable timestamp option? 3-way handshake

Timestamp in SYN, timestamp in its ACK

kind=8 len=10 timestamp value timestamp echo replyTimestamp:

4 bytes 4 bytes1 byte 1 byte

Page 49: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 49

TCP Timer Management in Linux Retransmit Timer

To start retransmitting Persist Timer

To prevent deadlocks Keepalive Timer (non-standard)

To clean up redundant TCP states

Page 50: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Functions of All TCP Timers

Chapter 5: Transport Layer 50

Name Function

connection timer To establish a new TCP connection, a SYN segment is sent. If no response of the SYN segment is received within connection timeout, the connection is aborted.

retransmission timer TCP retransmits the data if data is not acknowledged and this timer expires.

delayed ACK timer The receiver must wait till delayed ACK timeout to send the ACK. If during this period there is data to send, it sends the ACK with piggybacking.

persist timer A deadlock problem is solved by the sender sending periodic probes after the persist timer expires.

keepalive timer If the connection is idle for a few hours, the keepalive timeout expires and TCP sends probes. If no response is received, TCP thinks that the other end is crashed.

FIN_WAIT_2 timer This timer avoids leaving a connection in the FIN_WAIT_2 state forever, if the other end is crashed.

TIME_WAIT timer The timer is used in the TIME_WAIT state to enter the CLOSED state.

Page 51: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 51

Open Source Implementation 5.5: TCP Retransmit Timer Approximating RTT

Linux provides good retx timer granularity Just like other timers

BSD-derived UNIXs have bad granularity For minimizing timer overhead

check wether ACKed every 500 ms RTT is over-estimated RTO is then over-estimated Slow packet retx when lost recovered not by Fast Retx

Page 52: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 52

Open Source Implementation 5.6: TCP Persistent (or Probe) Timer When RWND=0 && Next RWND>0 lost

Deadlock occurs Sender waits for RWND>0 (window update) Receiver waits for new data

Solution Sender emits one byte of data to probe

Persist timer

Use RTO with binary exponential backoff until 120 seconds

tcp_output.c (Linux 2.6)

Page 53: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 53

Open Source Implementation 5.6 (cont): TCP Keepalive Timer (Non-standard) When no data exchange for a long time

Connection Timeout? Belongs to application’s preference

The other end is dead? Linux 2.6 Implementation (tcp_timer.c)

Call tcp_keepalive_timer() every 75 seconds Initialized by af_inet init routine searches every established TCP connection

If dead & not reboot => state cleared after 5 probes If dead & reboot => state cleared after getting RST

Page 54: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 54

Data Structures of TCP Connections in Linux Important variables:

include/net/sock.h

Page 55: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 55

Summary: Properties of TCP

Per-Flow Reliability Through ACKs Window-based Flow Control Self-clocking using ACKs

Page 56: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 56

TCP Performance

Interactive Connections Silly Window Syndrome

Bulk-Data Transfers ACK Compression Phenomenon Reno’s Multiple Packet Loss Problem

Page 57: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 57

TCP Performance Problems & Enhancement Interactive Connections

Silly Window Syndrome (SWS) Solution: Clark & Nagle

Bulk Data Transfers ACK Compression Phenomenon

Possible solution: Paced TCP Sender Reno’s Multiple Packet Loss Problem

Solution: NewReno, SACK, FACK

Page 58: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

TCP Performance Problems and SolutionsTransmission Style Problem Solution

Interactive connection Silly window syndrome Nagle, Clark

Bulk-data transfer ACK compression Zhang

Reno’s MPL* problem NewReno, SACK, FACK

*MPL stands for Multiple-Packet-Loss

Chapter 5: Transport Layer 58

Page 59: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 59

Performance of Interactive Connections – Problems & Solutions Problem: Silly Window Syndrome (SWS)

Sender transmits small packets Receiver advertises small window

Solution Sender sends whenever either of the following holds:

Data Accumulated to Full-sized Segment Data Accumulated to ½ RWND Nagle’s Algorithm Disabled/Not Applied

Receiver advertises window whenever either of the following holds: Buffer available to full-sized Segment Buffer available to ½ of its buffer space

Page 60: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

SWS : Receiver Advertises Small Window Client Server

RWND = 320

240/320

RWND = 80

2. Receive Segment; Send Ack, Reduce Window to 80

220/320

RWND = 40

4. Receive Segment; Send Ack, Reduce Window to 40

60/80

200/320

RWND = 30

4. Receive Segment; Send Ack, Reduce Window to 30

30/40

60/80

Data(Seq=1, Len=320)

Data(Seq=321, Len=80)

Data(Seq=401, Len=40)

ACK(Ack=321, RWND=80)

ACK(Ack=401, RWND=40)

ACK(Ack=441, RWND=30)

• • •

60Chapter 5: Transport Layer

Page 61: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 61

Performance Enhancement of Interactive Connections – Nagle’s Algorithm To efficiently utilize the bandwidth resource

TELNET: Typing speed vs. available bandwidth When RTT is short (bandwidth may be sufficient)

Inter-character spacing > RTT Only one outstanding packet per RTT => efficient!!!

When RTT is large (bandwidth may be insufficient) Inter-character spacing < RTT

Multiple single-character packets per RTT => inefficient!! Nagle: don’t send small packet until pipe is clean (keep

only one packet in pipe) => efficient!!! When RTT is short

Nagle’s Algorithm is rarely used When RTT is large

Nagle’s Algorithm is often used

The beauty of Nagle

Page 62: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 62

Page 63: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 63

Performance of Bulk Data Transfers Computing the Performance through Bandwidth Delay

Product (BDP) Horizontal Dimension: Delay Vertical Dimension: Bandwidth Shaded Area: Packet Size BDP=pipe size=Bandwidth x RTT

Page 64: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 64

Performance of Bulk Data Transfers Filling the network pipe

Highest Performance: Pipe is full

Pipe for sending data packets

Pipe for replying ACKs

WAN Pipe

TCP Sender TCP Receiver

TCP Sender TCP Receiver

Page 65: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 65

Performance of Bulk Data Transfers Steps of filling the pipe using Congestion Avoidance

cwnd=1

cwnd=2

cwnd=3

cwnd=4

cwnd=5

cwnd=6

(1) (2) (3) (4) (5) (6)

(7) (8) (9) (10) (11) (12)

(13) (14) (15) (16) (17) (18)

(19) (20) (21) (22) (23) (24)

(25) (26) (27) (28) (29) (30)

(31) (32) (33) (34) (35) (36)

Page 66: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 66

Performance of Bulk Data Transfers Modeling TCP Throughput

Given RTT, segment size s, loss rate p:

where c is a constant value Given additional information: Max Window Size Wm, #

delayed ACK b, RTO

pt

scpstT

RTT

RTT

),,(

)321(8

33,1min

32

,min),,,(2pp

bpt

bpt

s

t

sWpsttT

RTORTTRTT

mRTORTT

Page 67: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 67

Problem of TCP Bulk Data Transfers:ACK-Compression Phenomenon Bursty traffic when

Simultaneous 2-Way Traffic Asymmetric Path

No general solution Distribute a window of packets across an RTT may alleviate the

phenomenonSlow link

Properspacing

ReceiverSender ACKs have proper spacing

Slow link

Page 68: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Historical Evolution: Multiple-Packet-Loss Recovery in NewReno, SACK, FACK and Vegas Solution (I) to TCP Reno’s Problem: TCP NewReno

Solution (II) to TCP Reno’s Problem: TCP SACK Solution (III) to TCP Reno’s Problem: TCP FACK Solution (IV) to TCP Reno’s Problem: TCP Vegas

Copyright Reserved 2009 68Chapter 5: Transport Layer

Page 69: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 69

Problem of TCP Bulk Data Transfers:Reno’s Multiple Packet Lost Problem(1/2)cwnd=8

awnd=8

38 37 36 35 34 33 32 31

cwnd=8awnd=8

30 30 30 30 30

cwnd=7awnd=8

30 30

31

Sender sent segment 31-38

Receiver replied five duplicateACKs of segment 30

Sender received three duplicateACKs and cwnd is changed to(8/2)+3 packets. The lostsegment 31 is retransmitted.

S

S

S

D

D

D

(1)

(2)

(3)

Page 70: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 70

Problem of TCP Bulk Data Transfers:Reno’s Multiple Packet Lost Problem(2/2)cwnd=9

awnd=8->9

32

39

cwnd=4awnd=7

32

cwnd=4awnd=7

Receiver replied the ACK ofsegment 32 when it received theretransmitted segment 31. Thisis a partial ACK.

Sender exited the fast recoveryand entered the congestionavoidance state when receivingthe partial ACK. Cwnd ischanged to 4 segments.

Sender waited until timeout

S

S

S

D

D

D

(4)

(5)

(6)

Page 71: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 71

Eliminating MPL Problem (I):TCP NewReno (1/3) RFC 2582: Extending Fast-Recovery Phase

Remain in Fast-Recovery until All data in pipe before detecting 3-Dup ACK are ACKed

cwnd=8awnd=8

38 37 36 35 34 33 32 31

cwnd=8awnd=8

30 30 30 30 30

cwnd=7awnd=8

30 30

31

Sender sent segment 31-38

Receiver replied five duplicateACKs of segment 30

Sender received three duplicateACKs and cwnd is changed to(8/2)+3 packets. The lostsegment 31 is retransmitted.

S

S

S

D

D

D

(1)

(2)

(3)

Copyright Reserved 2009

Page 72: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 72

Eliminating MPL Problem (I):TCP NewReno (2/3)

41

cwnd=9awnd=8->9

32

39

cwnd=8awnd=7->8

32

cwnd=9awnd=8->9

Receiver replied the ACK ofsegment 32 when it received theretransmitted segment 31. Thisis a partial ACK.

Sender received a partial ACK ofsegment 32 and immediatelyretransmitted the lost segment33. Cwnd is changed to 9-2+1

Sender received a duplicateACK and added cwnd by 1, thussegment 41 is kicked out.Receiver replied a partial ACKand one duplicate ACK ofsegment 33.

S

S

S

D

D

D

(4)

(5)

(6)

40 33

33 33

cwnd=9awnd=9->8S D(7)

33

34 The partial ACK triggered thesender retransmitting segment34 and shrink the awnd to 8 (41-33). Receiver replied an ACK ofsegment 33 upon receivingsegment 34.33

Page 73: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 73

Eliminating MPL Problem (I):TCP NewReno (3/3)

33

cwnd=10awnd=9->10S D(8)

43 42 34

cwnd=11awnd=10->11S D(9)

44

Upon receiving the duplicateACK of segment 33, cwnd wasadvanced by one. Since awndwas smaller than cwnd, two newsegments were sent.

On receiving the duplicate ACKof segment 33, cwnd wasadvanced by one and thussegment 44 was triggered out.

43 42 34

cwnd=11awnd=10->11S D(10) Receiver replied ACKs of

segment 40, 42, 43, and 44.

40 4342 44

cwnd=4awnd=4S D(11)

Sender exited fast recoveryupon receiving the ACK ofsegment 40. Cwnd and awndwere reset to 4.

4342 44

Page 74: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 74

Eliminating MPL Problem (II):TCP SACK (1/2) Reporting non-contiguous block of data

31

30 30

4 5

cwnd=8awnd=8

38 37 36 35 34 33 32 31

cwnd=8awnd=8

30 30 30 30 30

cwnd=4awnd=6

Sender received ACK ofsegment 30 and sent segment31-38.

Receiver sent five duplicateACKs with SACK options ofsegment 30

Sender received duplicate ACKsand began retransmitting the lostsegments reported in the SACKoptions. Awnd was set to 8-3+1(three duplicate ACKs and oneretransmitted segment.).

S

S

S

D

D

D

(1)

(2)

(3)

1 2 3 4 5

1

2

3

4

5

(32,32; 0, 0; 0, 0)

(35,35;32,32; 0, 0)

(35,36;32,32; 0, 0)

(35,37;32,32; 0, 0)

(35,38;32,32; 0, 0)

SACK options:

Page 75: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 75

Eliminating MPL Problem (II):TCP SACK (2/2)

34

cwnd=4awnd=4

32

cwnd=4awnd=2->4

Receiver replied partial ACKs forreceived retransmittedsegments.

Sender received partial ACKs,reduced awnd by two, and thusretransmitted two lost segments.

S

S

D

D

(4)

(5)

cwnd=4awnd=4S D(6)

Receiver replied ACKs forreceived retransmittedsegments.

cwnd=4awnd=4S D(7)

42 41 40

Sender exited fast recovery afterreceiving ACK of segment 38.

33

33 38

39

Page 76: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 76

Eliminating MPL Problem (III):TCP FACK (1/2) Extension of SACK, better estimation of awnd

30

3

31

30 30

4 5

cwnd=8awnd=8

38 37 36 35 34 33 32 31

cwnd=8awnd=8

30 30 30 30 30

cwnd=4awnd=4

Sender received ACK ofsegment 30 and sent segment31-38.

Receiver sent five duplicateACKs with SACK options ofsegment 30

Sender received two duplicateACKs and began retransmittingthe lost segments reported in theSACK options.

S

S

S

D

D

D

(1)

(2)

(3)

1 2 3 4 5

1

2

3

4

5

(32,32; 0, 0; 0, 0)

(35,35;32,32; 0, 0)

(35,36;32,32; 0, 0)

(35,37;32,32; 0, 0)

(35,38;32,32; 0, 0)

SACK options:

Page 77: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 77

Eliminating MPL Problem (III):TCP FACK (2/2)

43 42 41

40

cwnd=4awnd=4

32

cwnd=4awnd=4

Sender calculated awnd forreceived duplicate ACKs andkept sending packets allowed.

Receiver replied ACKs.

S

S

D

D

(4)

(5)

cwnd=4awnd=4S D(6) Sender exited fast recovery after

receiving ACK of segment 38.

40

39 34 33

33 38 39

Page 78: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 78

Performance of Bulk Data Transfers

What have you observed?

When RTTs are heterogeneous……

Page 79: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 79

5.4 Socket Programming Interface

Programming Interface to Protocol Layers in LinuxAccessing End-to-End Protocol LayerAccessing Internetworking Protocol LayerAccessing Direct-Linked Protocol Layer

Packet Capturing & Filtering

Page 80: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 80

5.4 Socket Programming Interface Issue: programming interface to protocol layers

Socket interface in Linux 2.2.17 kernel

BSD Socket INET Socket

TCP/UDP IP

ethernet NIC Driverethernet-header builder

ARPICMP …

Socket Library

Application

Kernel-space

User-spaceSocket interface

drivers/net/*.{c,h}

net/ethernet/eth.c

net/ipv4/{ip*,icmp*}

net/ipv4/{tcp*,udp*}

net/ipv4/af_inet.c

net/socket.c

Page 81: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 81

Bridging Applications & End-to-End Protocols socket(domain, type, protocol)

INET domain: AF_INET type

UDP: SOCK_DGRAM TCP: SOCK_STREAM

Protocol: NULL Typical Applications:

telnet ftp HTTP

Page 82: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 82

Elementary Socket: TCP Client/Server

socket()

connect()

write()

read()

close()

socket()

bind()

listen()

accept()

read()

write()

read()

close()

connection establishment(TCP Three-way handshake)

data (request)

data (reply)

end-of-life notification

process request

blocks until connectionfrom client

TCP Server

TCP Client

obtain a descriptor

assign IP & portto the socket

1. switch to passive socket2. create connection queue

enter ESTABLISHED state

initiate 3-wayhandshake

obtain a descriptor

Page 83: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 83

Elementary Socket: UDP Client/Server

socket()

sendto()

recvfrom()

close()

socket()

bind()

sendto()

data (request)

data (reply)

process request

blocks until connectionfrom client

UDP Server

UDP Clientrecvfrom()

obtain a descriptor

obtain a descriptor

assign IP & port to thesocket

Page 84: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 84

Open Source Implementation 5.7: Socket Read/Write Inside out

User SpaceServer Client

Server socket creation send data Client socket creation send data

socket() bind() listen() write()accept() socket() read()connect()

sys_listen

inet_listen

sys_write

do_sock_write

sock_sendmsg

inet_sendmsg

tcp_sendmsg

tcp_write_xmit

sys_socket

sock_create

inet_create

sys_bind

inet_bind

sys_accept

inet_accept

tcp_accept

wait_for_connection

Kernel Space

sys_socket

sock_create

inet_create

sys_read

do_sock_read

sock_recvmsg

sock_common_

recvmsg

tcp_recvmsg

memcpy_toiovec

sys_connect

inet_stream_connect

tcp_v4_getport

tcp_v4_connect

inet_wait_connect

sys_socketcall

Internet

sys_socketcall

Page 85: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 85

Open Source Implementation 5.7: Socket Read/Write Inside out union u

linux/sched.hstruct files_struct

f_dentryf_list

max_fds

f_opf_vfsmnt

f_countf_flagsf_modef_pos

……

d_flagsd_count

d_inoded_parent

……

linux/fs.hstruct file

linux/dentry.hstruct dentry

connectclose

disconnect

ioctlaccept

initdestoryshutdownsetsockoptgetsockopt

net/sock.hstruct proto

sendmsgrecvmsg

……

tcp_v4_connecttcp_close

tcp_disconnect

tcp_ioctltcp_accept

tcp_v4_init_socktcp_v4_destory_socktcp_shutdowntcp_setsockopttcp_getsockopttcp_sendmsgtcp_recvmsg

……

ipv4/tcp_ipv4.cstruct tcp_func

net/sock.hstruct sock

s_addrd_addr

dport

bound_dev_ifsport

……receive_queuewrite_queue

proto……

……union tp_pinfo

struct tcp_opt……

snd_cwnd……

……sk_filter

……socket

……

struct socket

……

linux/fs.hstruct inode

……

inodefile

……

……

sk

file_lockcount

max_fds

next_fdmax_fdset

fd[0]fd[1]

fd[255]……

……

opened Linux socket

Page 86: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Performance Matters: Interrupt and Memory Copy at Socket

Chapter 5: Transport Layer 86

Latency in transmitting TCP segments in the TCP layer

Latency in receiving TCP segments in the TCP layer

Page 87: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 87

Bridging Applications to Internetworking Protocols in Linux 2.6 socket(domain, type, protocol)

Parameters: PACKET domain: PF_PACKET type: SOCK_DGRAM Protocol: NULL

Kernel functions net/packet/af_packet.c

Typical Applications: ping traceroute

Page 88: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 88

Bridging Applications to Node-to-Node Protocols in Linux 2.6 socket(domain, type, protocol)

Parameters: PACKET domain: PF_PACKET type: SOCK_RAW Ethernet Encapsulated IP packet: ETH_P_IP

Or others in “/usr/include/linux/if_ether.h”

Complete access to Ethernet header Kernel functions

net/packet/af_packet.c

Typical Applications: Packet sniffers => performance problem!!! Hacking tools

Page 89: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Open Source Implementation 5.8: Bypassing the End-to-End Layerint main() { int n; int fd; char buf[2048]; if((fd = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL))) == -1) { printf("fail to open socket\n"); return(1); } while(1) { n = recvfrom(fd, buf, sizeof(buf),0,0,0); if(n>0) printf("recv %d bytes\n", n); } return 0;}

Copyright Reserved 2009 89Chapter 5: Transport Layer

Page 90: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Open Source Implementation 5.9: Making Myself Promiscuous

strncpy(ethreq.ifr_name,"eth0",IFNAMSIZ);

ioctl(sock, SIOCGIFFLAGS, &ethreq);

ethreq.ifr_flags |= IFF_PROMISC;

ioctl(sock, SIOCSIFFLAGS, &ethreq);

Copyright Reserved 2009 90Chapter 5: Transport Layer

Page 91: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 91

Packet Sniffers: Packet Capturing & Filtering Capture until what header?

Towards Efficient Packet Filtering: Layered Model User-Space Tool: tcpdump User-Space Packet Filter: libpcap (portable) Kernel-Space Packet Filter: Linux Socket Filter

Page 92: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Copyright Reserved 2009 92

Open Source Implementation 5.10: Linux Socket Filter Linux Socket Filter (net/core/filter.c)

Similar to BPF (Berkley Packet FIilter)

network monitor

network monitor

rarpd

filter filter filter

buffer buffer bufferprotocol

stack

user

kernel

link-leveldriver

link-leveldriver

link-leveldriver

network

kernel

92Chapter 5: Transport Layer

Page 93: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 93

5.5 Transport Protocols for Streaming

IssuesReal-Time Transport Protocol (RTP)RTP Control Protocol (RTCP)Example: VoIP Gateway Using RTP/RTCP

Page 94: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Issue 1: Multi-homing & Multi-streaming

Stream Control Transmission Protocol Multi-homing

a session of the SCTP can be concurrently constructed by multiple connections through different network adapters

a heartbeat for each connection Multi-streaming

Support ordered reception for each streaming Avoid the HOL blocking of TCP. a 4-way handshake mechanism for security

Copyright Reserved 2009 94Chapter 5: Transport Layer

Page 95: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Issue 2: Smooth Rate Control and TCP-friendliness AIMD is not suitable for streaming TCP-friendliness: A flow should ….

respond to the congestion at the transit state use no more bandwidth than a TCP flow at the

steady state when both received the same network conditions,

such as packet loss ratio and RTT. Datagram Congestion Control Protocol

(DCCP) : free selection of a congestion control scheme

Copyright Reserved 2009 95Chapter 5: Transport Layer

Page 96: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Principle in Action: Streaming: TCP or UDP? Why not TCP

loss retransmission mechanism continuous rate fluctuation

Why not UDP too simple, dropped by network devices for security

Both are the only two mature protocols, so.. UDP is used to carry pure audio streaming,

like audio and VoIP. TCP is used for streaming : large buffer ->delay

OK one-way application, e.g. watching clips from YouTube Not OK for the interactive application, like video conference,

Copyright Reserved 2009 96Chapter 5: Transport Layer

Page 97: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 97

Issues 3: Playback Reconstruction and Path Quality Report Issues: Codec Encapsulation & Path Quality Report

Data-Plane: Video/Voice Codecs Video: H.263… Voice: G.729…

Control-Plane: Delay/Jitter/Loss Report RFC Standards: RTP & RTCP

RTP: Data-Plane, Encapsulating the Chosen Codec RTCP: Control-Plane, Reporting Delay/Jitter/Loss to

Senders

Page 98: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 98

RTP (Real-Time Protocol)

Objectives Eliminating Packet Reorder & Loss Detection:

Sequence # Timestamp Synchronization Source Identifier Contributing Source Identifier

Header Format

Page 99: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 99

RTCP (Real-Time Transport Protocol) Objectives

Reporting End-to-End Delay Reporting Delay Jitter Reporting Loss Rate

Report to sender for what? Switch to lower-bitrate codec

User may get smoother real-time

Page 100: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 100

VoIP using RTP: Multiplexing using SSRC One RTP session between VoIP gateways

Many phone call between branch offices Multiplexing using different SSRC ID within the RTP

session

IP Cloud

Public Telephone Network

Public Telephone Network

Gatekeeper

VoIP Gateway

VoIP Gateway

Phone Phone

Internet or private IP network

Page 101: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 101

VoIP using RTP: Codec Encapsulation Compress/Decompress

Analog to Digital Compander

Analog to Digital

Converter

CompanderA-Lawu-Law

VoIP Gateway

64 kbps

8 bits, 8khz

128 kbps

16 bits, 8khz

Digitaloutputsignal

Analog signal source

64kbps

Inside a VoIP Gateway Codec

The converter assigns16 bits evenly distributedacross x,y coordinates of the sine

The compander compresses the data

Page 102: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

Chapter 5: Transport Layer 102

Historical Evolution: RTP Implementation Resources Sample Implementation in RFC 1889

http://rfc.net/rfc1889.txt Vat

http://www-nrg.ee.lbl.gov/vat/ Rtptools

ftp://ftp.cs.columbia.edu/pub/schulzrinne/rtptools/ NeVoT

http://www.cs.columbia.edu/~hgs/rtp/nevot.html RTP Library

http://www.iasi.rm.cnr.it/iasi/netlab/gettingSoftware.html by E.A.Mastromartino offers convenient ways to

incorporate RTP functionality into C++ Internet applications.

Page 103: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

5.6 Summary (1/2) Three key features in process-to-process

channels (1) port-level addressing, (2) reliable packet

delivery, (3) flow rate control UDP: (1) only; TCP: all of them

TCP techniques three-way handshake ack/retx, sliding-window flow control various versions of congestion control

to retx potentially lost packets

Chapter 5: Transport Layer 103

Page 104: Chapter 5: Transport Layer1 Computer Networks An Open Source Approach Chapter 5: Transport Layer

5.6 Summary (2/2) Real-time transport by RTP/RTCP

multi-streaming, multi-homing, smooth rate control, TCP-friendliness, playback reconstruction, and path quality reporting

Socket interfaces to different layers

Chapter 5: Transport Layer 104