As of December 2010, the performance of the SCTP implementation in Linux is severely lacking. One problem is the user configured fixed kernel buffer size, which rarely provides a good trade-off between memory and speed. This document describes my work towards fixing that problem.
Each socket's buffer space does not actually exist as a single address
range. Instead, a limit is set per socket on the aggregate size of all
allocated buffers. These variables are called
The memory for each chunk is allocated by calling
alloc_skb, which returns a
struct sk_buff *. These structures are used to form
Buffer memory accounting is performed by calling the
When a blocking sender tries to send data on a socket that has run out
of send buffer space, it needs to wait. This is accomplished by calling
sctp_wait_for_sndbuf. The waiting processes are awakened by
There already are a few sysctl variables for this:
net.sctp.sctp_mem = 93504 124672 187008 net.sctp.sctp_rmem = 4096 349500 3989504 net.sctp.sctp_wmem = 4096 16384 3989504
sysctl_sctp_mem defines global limits measured in pages.
sysctl_sctp_wmem define per
socket limits measured in bytes. As of Linux 2.6.37, these
variables are not used for anything.
The nine sysctl variables defined for SCTP have similarly named counterparts in the TCP implementation. Only six of these are actually used:
sysctl_tcp_mem are never
used for any purpose in Linux 2.6.37, contrary to claims made in the
used to set the initial values of
sk_sndbuf, upon socket creation.
sysctl_tcp_wmem is used in two places:
tcp_should_expand_sndbufis called to determine if the send buffer should be increased. If so, it is adjusted to twice the size of the congestion window, clamped to
tcp_should_expand_sndbuf checks for a few conditions, all
of which must be false:
sk_rcvbuf is adjusted in three places.
tcp_rcv_space_adjustis called to determine the ideal receive buffer size.
tcp_rcv_space_adjust calculates the amount of data passed
into user-space over the course of one round-trip, times 2. The receive
buffer size is increased up to this value, if it is greater than the
sysctl_tcp_mem is used in the function
tcp_too_many_orphans is used to determine if the number of
orphaned sockets is past a system-wide limit, or the global TCP memory
usage is past the limit defined by
function is called in a couple of places to determine whether a socket's
resources should be freed early in the tear-down phase.
Unlike for SCTP, the size of the send buffer decides the largest
message to can pass to
sendmsg. Thus, a user may break some
applications if he sets the buffer size to small. For example, the
lksctp-tools unit tests expects support for 32768 byte messages, while
the default send buffer size is 16384 bytes.
One problem with the algorithm for determining the receive buffer size is that it requires an RTT estimate. This estimate requires at least one outbound data packet. For connections with unidirectional data flow, like the data connections used in FTP (IIRC), the receiving end does not send any data back and thus never gets any RTT estimate. The receive buffer size will consequently remain constant.
Computer X has two ethernet interfaces. One interface is connected to Computer A, while the other interface is connected to a switch. Computer B is connected to the same switch. The tests are performed between Computer A and Computer B.
Data is sent from Computer A to Computer B. Computer B echoes all data it receives back to the originating host. The amount of data received back by Computer A is used for calculating the transfer rate.
The kernel buffer sizes are polled for every 4096 bytes sent.
All tests ran for 10 seconds. Delays were added using
qdisc add dev eth1 root netem delay 10ms and
SCTP with auto buffer tuning starts out with a 16384 byte send buffer, while SCTP from Linux 2.6.37 is using a constant 122880 byte buffer. This explains why auto buffer tuning sometimes appears to perform worse. The solution is to fix whatever is causing fluctuations in the adjustment of congestion and receive windows. A short term fix is setting the default buffer at 122880 bytes.
These are the buffer sizes as they develop on Computer B during the 10 second tests for 0ms, 10ms and 100ms frame delay. Where the lines end, the buffer sizes become stable.
Y = bytes, X = seconds
|0 ms delay|
|10 ms delay|
|100 ms delay|
Disclaimer: This is my first time working on protocols below the transport layer, and I'm not entirely sure that what I'm writing makes any sense or is not blindingly obvious.
With this patch, transfer rates are much less affected by network latency. However, according to some quick calculations, an ideal slow-starting protocol running under similar conditions to these tests should be able to transfer more than 40 MB/s during the first 10 seconds at 40 ms RTT. Neither SCTP nor TCP are close to this. Also, SCTP performs much worse than TCP at lower latencies, though the difference decreases sharply as latency inreases..
I suspect there are race conditions that will cause the both SCTP and TCP to leave the exponential phase early due to reports of increased receiver windows not arriving early enough. A possible fix would be to resume exponential growth when the receive window increases and there has been no packet loss. If this is in fact what's happening, it could explain the occasinal extremely poor performace — if the exponential phase is left during the first dozen packets or so, it could take a long time to reach 100 MB/s..
As for the difference between TCP and SCTP at 0ms latency, I'm inclined to attribute this to the branch-heavy nature of the Linux SCTP code. There are many function calls and linked list operations involved in getting a single SCTP packet into the network interface buffer. It seems to me like the code is designed for beauty of structure rather than minimal CPU cache misses and branching. The profile of the SCTP code (as gathered by OProfile) is rather flat, meaning that the heavy work is not condensed into one function.
|These variables are now needed in several translation units, so I thought it was best to place them next to the global struct in <net/sctp/structs.h>|
|The corresponding variables in the TCP implementation are also kept in a struct called rcvq_space. I don't think this is the best name, but in this case I value consistency over clarity.|
|The socket is now referenced so often in this function that I decided to put it into a separate variable.|
|The desired value of sk->sk_rcvbuf is called rcvmem in the TCP implementation. We do the same here.|
|When we receive some data that would overflow the receive buffer, we try to increase the size of the receive buffer before resorting to choking the sender.|
|The peer's initial advertised receive window is rarely an indication of the transport's bandwidth product delay. The send buffer is typically set to twice the size of the congestion window, so I decided to set the slow-start threshold to half the maximum size of the send buffer. The SCTP RFC allows any value here. This change is made in two places.|
|When data is delivered to user space, report the amount to sctp_rcvbuf_adjust. If the receive buffer size is updated, the difference is returned and stored in rcvbuf_increment. This amount is then added to the rwnd increment.|
|This is where we initialize the socket buffer sizes to the configured defaults.|
|This function raises the socket send buffer size to twice the size of the congestion window, subject to certain limitations. These are the same limitations used in the TCP implementation.|
|This function raises the socket receive buffer size to a specified value, subject to limitations similar to those in the send buffer code.|
|This function measures how much data is copied into user space over the course of one RTT, times 2. If the receive buffer is smaller than this amount, it is increased.|
|cwnd was (possibly) just updated, so we need to call sctp_raise_sndbuf to maintain the send buffer at twice the size of the congestion window.|
|This is the same change to the initial ssthresh as describe above.|
|This is the second place where we measure the amount of data passed to user space. See above for details.|
TCP memory documentation, created by Ian McDonald when trying to implement memory management for Net:DCCP.