Implementing SCTP Auto Buffer Tuning for Linux

The problem

As of December 2010, the performance of the SCTP implementation in Linux is severely lacking. One problem is the user configured fixed kernel buffer size, which rarely provides a good trade-off between memory and speed. This document describes my work towards fixing that problem.

How does Linux allocate SCTP socket buffer space?

Each socket's buffer space does not actually exist as a single address range. Instead, a limit is set per socket on the aggregate size of all allocated buffers. These variables are called sk_sndbuf and sk_rcvbuf.

The memory for each chunk is allocated by calling __alloc_skb, alias alloc_skb, which returns a struct sk_buff *. These structures are used to form linked lists.

Buffer memory accounting is performed by calling the sctp_set_owner_w, sctp_wfree, sctp_skb_set_owner_r and sctp_sock_rfree functions.

When a blocking sender tries to send data on a socket that has run out of send buffer space, it needs to wait. This is accomplished by calling sctp_wait_for_sndbuf. The waiting processes are awakened by calls to sctp_wfree.

How should the buffer sizes be configured?

There already are a few sysctl variables for this:

net.sctp.sctp_mem = 93504       124672  187008
net.sctp.sctp_rmem = 4096       349500  3989504
net.sctp.sctp_wmem = 4096       16384   3989504

sysctl_sctp_mem defines global limits measured in pages. sysctl_sctp_rmem and sysctl_sctp_wmem define per socket limits measured in bytes. As of Linux 2.6.37, these variables are not used for anything.

How are buffers tuned in the TCP implementation?

The nine sysctl variables defined for SCTP have similarly named counterparts in the TCP implementation. Only six of these are actually used:

sysctl_tcp_rmem[0], sysctl_tcp_wmem[0] and sysctl_tcp_mem[1] are never used for any purpose in Linux 2.6.37, contrary to claims made in the documentation.

sysctl_tcp_rmem[1] and sysctl_tcp_wmem[1] are used to set the initial values of sk_rcvbuf and sk_sndbuf, upon socket creation.

sysctl_tcp_wmem[2] is used in two places:

  1. When a connection is established, the send buffer is increased to fit three maximum sized segments plus some overhead. This value is clamped to sysctl_tcp_wmem[2].
  2. After an incoming ACK has caused the send queue to shrink, tcp_should_expand_sndbuf is called to determine if the send buffer should be increased. If so, it is adjusted to twice the size of the congestion window, clamped to sysctl_tcp_wmem[2].

tcp_should_expand_sndbuf checks for a few conditions, all of which must be false:

After initialization, sk_rcvbuf is adjusted in three places.

tcp_rcv_space_adjust calculates the amount of data passed into user-space over the course of one round-trip, times 2. The receive buffer size is increased up to this value, if it is greater than the current size.

sysctl_tcp_mem[2] is used in the function tcp_too_many_orphans.

tcp_too_many_orphans is used to determine if the number of orphaned sockets is past a system-wide limit, or the global TCP memory usage is past the limit defined by sysctl_tcp_mem[2]. This function is called in a couple of places to determine whether a socket's resources should be freed early in the tear-down phase.

Unresolved issues

Unlike for SCTP, the size of the send buffer decides the largest message to can pass to sendmsg. Thus, a user may break some applications if he sets the buffer size to small. For example, the lksctp-tools unit tests expects support for 32768 byte messages, while the default send buffer size is 16384 bytes.

One problem with the algorithm for determining the receive buffer size is that it requires an RTT estimate. This estimate requires at least one outbound data packet. For connections with unidirectional data flow, like the data connections used in FTP (IIRC), the receiving end does not send any data back and thus never gets any RTT estimate. The receive buffer size will consequently remain constant.

Performance

Setup

Computer X has two ethernet interfaces. One interface is connected to Computer A, while the other interface is connected to a switch. Computer B is connected to the same switch. The tests are performed between Computer A and Computer B.

Data is sent from Computer A to Computer B. Computer B echoes all data it receives back to the originating host. The amount of data received back by Computer A is used for calculating the transfer rate.

The kernel buffer sizes are polled for every 4096 bytes sent.

All tests ran for 10 seconds. Delays were added using /sbin/tc qdisc add dev eth1 root netem delay 10ms and 100ms on Computer X.

Transfer rates

SCTP with auto buffer tuning starts out with a 16384 byte send buffer, while SCTP from Linux 2.6.37 is using a constant 122880 byte buffer. This explains why auto buffer tuning sometimes appears to perform worse. The solution is to fix whatever is causing fluctuations in the adjustment of congestion and receive windows. A short term fix is setting the default buffer at 122880 bytes.

Buffer size development

These are the buffer sizes as they develop on Computer B during the 10 second tests for 0ms, 10ms and 100ms frame delay. Where the lines end, the buffer sizes become stable.

Y = bytes, X = seconds

0 ms delay
10 ms delay
100 ms delay

Interpretation of the Results

Disclaimer: This is my first time working on protocols below the transport layer, and I'm not entirely sure that what I'm writing makes any sense or is not blindingly obvious.

With this patch, transfer rates are much less affected by network latency. However, according to some quick calculations, an ideal slow-starting protocol running under similar conditions to these tests should be able to transfer more than 40 MB/s during the first 10 seconds at 40 ms RTT. Neither SCTP nor TCP are close to this. Also, SCTP performs much worse than TCP at lower latencies, though the difference decreases sharply as latency inreases..

I suspect there are race conditions that will cause the both SCTP and TCP to leave the exponential phase early due to reports of increased receiver windows not arriving early enough. A possible fix would be to resume exponential growth when the receive window increases and there has been no packet loss. If this is in fact what's happening, it could explain the occasinal extremely poor performace — if the exponential phase is left during the first dozen packets or so, it could take a long time to reach 100 MB/s..

As for the difference between TCP and SCTP at 0ms latency, I'm inclined to attribute this to the branch-heavy nature of the Linux SCTP code. There are many function calls and linked list operations involved in getting a single SCTP packet into the network interface buffer. It seems to me like the code is designed for beauty of structure rather than minimal CPU cache misses and branching. The profile of the SCTP code (as gathered by OProfile) is rather flat, meaning that the heavy work is not condensed into one function.

The Patch

diff --git a/include/net/sctp/sctp.h b/include/net/sctp/sctp.h
index 505845d..e06d757 100644
--- a/include/net/sctp/sctp.h
+++ b/include/net/sctp/sctp.h
@@ -128,6 +128,10 @@ extern int sctp_register_pf(struct sctp_pf *, sa_family_t);
int sctp_backlog_rcv(struct sock *sk, struct sk_buff *skb);
int sctp_inet_listen(struct socket *sock, int backlog);
void sctp_write_space(struct sock *sk);
+void sctp_raise_sndbuf(struct sctp_transport *transport);
+void sctp_raise_rcvbuf(struct sctp_association *asoc, int rcvmem);
+unsigned int sctp_rcvbuf_adjust(struct sctp_association *asoc,
+ unsigned int len);
void sctp_data_ready(struct sock *sk, int len);
unsigned int sctp_poll(struct file *file, struct socket *sock,
poll_table *wait);
diff --git a/include/net/sctp/structs.h b/include/net/sctp/structs.h
index 69fef4f..fa160bb 100644
--- a/include/net/sctp/structs.h
+++ b/include/net/sctp/structs.h
@@ -274,6 +274,10 @@ extern struct sctp_globals {
#define sctp_checksum_disable (sctp_globals.checksum_disable)
#define sctp_rwnd_upd_shift (sctp_globals.rwnd_update_shift)
These variables are now needed in several translation units, so I thought it was best to place them next to the global struct in <net/sctp/structs.h>
+extern long sysctl_sctp_mem[3];
+extern int sysctl_sctp_rmem[3];
+extern int sysctl_sctp_wmem[3];
+
/* SCTP Socket type: UDP or TCP style. */
typedef enum {
SCTP_SOCKET_UDP = 0,
@@ -1754,6 +1758,17 @@ struct sctp_association {
*/
__u32 rwnd_press;
The corresponding variables in the TCP implementation are also kept in a struct called rcvq_space. I don't think this is the best name, but in this case I value consistency over clarity.
+ /* Timer used for calculating the appropriate receive buffer space. */
+ struct {
+ /* The starting time of the current measurement period. */
+ unsigned long start_time;
+
+ /* The number of bytes delivered to user space in the current
+ * measurement period.
+ */
+ int delivered;
+ } rcvq_space;
+
/* This is the sndbuf size in use for the association.
* This corresponds to the sndbuf size for the association,
* as specified in the sk->sndbuf.
diff --git a/net/sctp/associola.c b/net/sctp/associola.c
index 5f1fb8b..7ad88f5 100644
--- a/net/sctp/associola.c
+++ b/net/sctp/associola.c
@@ -1462,7 +1462,9 @@ void sctp_assoc_rwnd_increase(struct sctp_association *asoc, unsigned len)
/* Decrease asoc's rwnd by len. */
void sctp_assoc_rwnd_decrease(struct sctp_association *asoc, unsigned len)
{
The socket is now referenced so often in this function that I decided to put it into a separate variable.
+ struct sock *sk = asoc->base.sk;
int rx_count;
The desired value of sk->sk_rcvbuf is called rcvmem in the TCP implementation. We do the same here.
+ int rcvmem;
int over = 0;
SCTP_ASSERT(asoc->rwnd, "rwnd zero", return);
@@ -1471,14 +1473,19 @@ void sctp_assoc_rwnd_decrease(struct sctp_association *asoc, unsigned len)
if (asoc->ep->rcvbuf_policy)
rx_count = atomic_read(&asoc->rmem_alloc);
else
- rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);
+ rx_count = atomic_read(&sk->sk_rmem_alloc);
+
When we receive some data that would overflow the receive buffer, we try to increase the size of the receive buffer before resorting to choking the sender.
+ if (rx_count >= sk->sk_rcvbuf) {
+ rcvmem = min_t(int, rx_count + len, sysctl_sctp_rmem[2]);
+ sctp_raise_rcvbuf(asoc, rcvmem);
+ }
/* If we've reached or overflowed our receive buffer, announce
* a 0 rwnd if rwnd would still be positive. Store the
* the pottential pressure overflow so that the window can be restored
* back to original value.
*/
- if (rx_count >= asoc->base.sk->sk_rcvbuf)
+ if (rx_count >= sk->sk_rcvbuf)
over = 1;
if (asoc->rwnd >= len) {
diff --git a/net/sctp/sm_make_chunk.c b/net/sctp/sm_make_chunk.c
index 2cc46f0..9c0629a 100644
--- a/net/sctp/sm_make_chunk.c
+++ b/net/sctp/sm_make_chunk.c
@@ -2352,7 +2352,7 @@ int sctp_process_init(struct sctp_association *asoc, sctp_cid_t cid,
*/
list_for_each_entry(transport, &asoc->peer.transport_addr_list,
transports) {
The peer's initial advertised receive window is rarely an indication of the transport's bandwidth product delay. The send buffer is typically set to twice the size of the congestion window, so I decided to set the slow-start threshold to half the maximum size of the send buffer. The SCTP RFC allows any value here. This change is made in two places.
- transport->ssthresh = asoc->peer.i.a_rwnd;
+ transport->ssthresh = sysctl_sctp_wmem[2] / 2;
}
/* Set up the TSN tracking pieces. */
diff --git a/net/sctp/socket.c b/net/sctp/socket.c
index fff0926..b79d670 100644
--- a/net/sctp/socket.c
+++ b/net/sctp/socket.c
@@ -1946,6 +1946,7 @@ SCTP_STATIC int sctp_recvmsg(struct kiocb *iocb, struct sock *sk,
struct sctp_ulpevent *event = NULL;
struct sctp_sock *sp = sctp_sk(sk);
struct sk_buff *skb;
+ int rcvbuf_increment;
int copied;
int err = 0;
int skb_len;
@@ -2016,8 +2017,10 @@ SCTP_STATIC int sctp_recvmsg(struct kiocb *iocb, struct sock *sk,
* rwnd by that amount. If all the data in the skb is read,
* rwnd is updated when the event is freed.
*/
When data is delivered to user space, report the amount to sctp_rcvbuf_adjust. If the receive buffer size is updated, the difference is returned and stored in rcvbuf_increment. This amount is then added to the rwnd increment.
- if (!sctp_ulpevent_is_notification(event))
- sctp_assoc_rwnd_increase(event->asoc, copied);
+ if (!sctp_ulpevent_is_notification(event)) {
+ rcvbuf_increment = sctp_rcvbuf_adjust(event->asoc, len);
+ sctp_assoc_rwnd_increase(event->asoc, copied + rcvbuf_increment);
+ }
goto out;
} else if ((event->msg_flags & MSG_NOTIFICATION) ||
(event->msg_flags & MSG_EOR))
@@ -3769,6 +3772,9 @@ SCTP_STATIC int sctp_init_sock(struct sock *sk)
SCTP_DBG_OBJCNT_INC(sock);
This is where we initialize the socket buffer sizes to the configured defaults.
+ sk->sk_sndbuf = sysctl_sctp_wmem[1];
+ sk->sk_rcvbuf = sysctl_sctp_rmem[1];
+
local_bh_disable();
percpu_counter_inc(&sctp_sockets_allocated);
sock_prot_inuse_add(sock_net(sk), sk->sk_prot, 1);
@@ -6253,6 +6259,83 @@ void sctp_write_space(struct sock *sk)
}
}
This function raises the socket send buffer size to twice the size of the congestion window, subject to certain limitations. These are the same limitations used in the TCP implementation.
+void sctp_raise_sndbuf(struct sctp_transport *transport)
+{
+ struct sctp_association *asoc = transport->asoc;
+ struct sock *sk = asoc->base.sk;
+ int sndmem;
+
+ sndmem = min_t(int, transport->cwnd * 2, sysctl_sctp_wmem[2]);
+
+ if (sk->sk_sndbuf >= sndmem)
+ return;
+
+ if (sk->sk_userlocks & SOCK_SNDBUF_LOCK)
+ return;
+
+ if (sctp_memory_pressure)
+ return;
+
+ if (atomic_long_read(&sctp_memory_allocated) >= sysctl_sctp_mem[0])
+ return;
+
+ if (transport->flight_size >= transport->cwnd)
+ return;
+
+ sk->sk_sndbuf = sndmem;
+}
+
This function raises the socket receive buffer size to a specified value, subject to limitations similar to those in the send buffer code.
+void sctp_raise_rcvbuf(struct sctp_association *asoc, int rcvmem)
+{
+ struct sock *sk = asoc->base.sk;
+
+ if (sk->sk_rcvbuf >= rcvmem)
+ return;
+
+ if (sk->sk_userlocks & SOCK_RCVBUF_LOCK)
+ return;
+
+ if (sctp_memory_pressure)
+ return;
+
+ if (atomic_long_read(&sctp_memory_allocated) >= sysctl_sctp_mem[0])
+ return;
+
+ sk->sk_rcvbuf = rcvmem;
+}
+
This function measures how much data is copied into user space over the course of one RTT, times 2. If the receive buffer is smaller than this amount, it is increased.
+unsigned int sctp_rcvbuf_adjust(struct sctp_association *asoc, unsigned int len)
+{
+ struct sctp_transport *transport = asoc->peer.last_data_from;
+ struct sock *sk = asoc->base.sk;
+ unsigned long time;
+ unsigned increment = 0;
+ int rcvmem = 0;
+
+ if (!asoc->rcvq_space.start_time)
+ goto new_measure;
+
+ asoc->rcvq_space.delivered += len;
+
+ time = jiffies - asoc->rcvq_space.start_time;
+
+ if (time < transport->rtt || !transport->rtt)
+ return 0;
+
+ rcvmem = min_t(int, asoc->rcvq_space.delivered * 2, sysctl_sctp_rmem[2]);
+
+ if (sk->sk_rcvbuf < rcvmem) {
+ increment = rcvmem - sk->sk_rcvbuf;
+ sctp_raise_rcvbuf(asoc, rcvmem);
+ }
+
+new_measure:
+ asoc->rcvq_space.start_time = jiffies;
+ asoc->rcvq_space.delivered = 0;
+
+ return increment;
+}
+
/* Is there any sndbuf space available on the socket?
*
* Note that sk_wmem_alloc is the sum of the send buffers on all of the
diff --git a/net/sctp/transport.c b/net/sctp/transport.c
index d3ae493..4c1768e 100644
--- a/net/sctp/transport.c
+++ b/net/sctp/transport.c
@@ -466,6 +466,8 @@ void sctp_transport_raise_cwnd(struct sctp_transport *transport,
transport->cwnd = cwnd;
transport->partial_bytes_acked = pba;
+
cwnd was (possibly) just updated, so we need to call sctp_raise_sndbuf to maintain the send buffer at twice the size of the congestion window.
+ sctp_raise_sndbuf(transport);
}
/* This routine is used to lower the transport's cwnd when congestion is
@@ -621,7 +623,7 @@ void sctp_transport_reset(struct sctp_transport *t)
*/
t->cwnd = min(4*asoc->pathmtu, max_t(__u32, 2*asoc->pathmtu, 4380));
t->burst_limited = 0;
This is the same change to the initial ssthresh as describe above.
- t->ssthresh = asoc->peer.i.a_rwnd;
+ t->ssthresh = sysctl_sctp_wmem[2] / 2;
t->rto = asoc->rto_initial;
t->rtt = 0;
t->srtt = 0;
diff --git a/net/sctp/ulpevent.c b/net/sctp/ulpevent.c
index aa72e89..4395e52b 100644
--- a/net/sctp/ulpevent.c
+++ b/net/sctp/ulpevent.c
@@ -987,7 +987,7 @@ static void sctp_ulpevent_receive_data(struct sctp_ulpevent *event,
static void sctp_ulpevent_release_data(struct sctp_ulpevent *event)
{
struct sk_buff *skb, *frag;
- unsigned int len;
+ unsigned int len, rcvbuf_increment;
/* Current stack structures assume that the rcv buffer is
* per socket. For UDP style sockets this is not true as
@@ -1012,7 +1012,8 @@ static void sctp_ulpevent_release_data(struct sctp_ulpevent *event)
}
done:
This is the second place where we measure the amount of data passed to user space. See above for details.
- sctp_assoc_rwnd_increase(event->asoc, len);
+ rcvbuf_increment = sctp_rcvbuf_adjust(event->asoc, len);
+ sctp_assoc_rwnd_increase(event->asoc, len + rcvbuf_increment);
sctp_ulpevent_release_owner(event);
}

Links

TCP memory documentation, created by Ian McDonald when trying to implement memory management for Net:DCCP.