diff options
author | Eric Dumazet <eric.dumazet@gmail.com> | 2012-04-24 22:12:06 -0400 |
---|---|---|
committer | Willy Tarreau <w@1wt.eu> | 2013-06-10 11:43:32 +0200 |
commit | 2ed8840ba140a6c4c50f32521a3f551be20d9883 (patch) | |
tree | 889630e3702e88f89c608f883fbc1791f2fc3e52 /net | |
parent | b8710128e201bc0d628d9644454585925a6e59fd (diff) |
tcp: allow splice() to build full TSO packets
[ This combines upstream commit
2f53384424251c06038ae612e56231b96ab610ee and the follow-on bug fix
commit 35f9c09fe9c72eb8ca2b8e89a593e1c151f28fc2 ]
vmsplice()/splice(pipe, socket) call do_tcp_sendpages() one page at a
time, adding at most 4096 bytes to an skb. (assuming PAGE_SIZE=4096)
The call to tcp_push() at the end of do_tcp_sendpages() forces an
immediate xmit when pipe is not already filled, and tso_fragment() try
to split these skb to MSS multiples.
4096 bytes are usually split in a skb with 2 MSS, and a remaining
sub-mss skb (assuming MTU=1500)
This makes slow start suboptimal because many small frames are sent to
qdisc/driver layers instead of big ones (constrained by cwnd and packets
in flight of course)
In fact, applications using sendmsg() (adding an additional memory copy)
instead of vmsplice()/splice()/sendfile() are a bit faster because of
this anomaly, especially if serving small files in environments with
large initial [c]wnd.
Call tcp_push() only if MSG_MORE is not set in the flags parameter.
This bit is automatically provided by splice() internals but for the
last page, or on all pages if user specified SPLICE_F_MORE splice()
flag.
In some workloads, this can reduce number of sent logical packets by an
order of magnitude, making zero-copy TCP actually faster than
one-copy :)
Reported-by: Tom Herbert <therbert@google.com>
Cc: Nandita Dukkipati <nanditad@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Tom Herbert <therbert@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: H.K. Jerry Chu <hkchu@google.com>
Cc: Maciej Żenczykowski <maze@google.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Cc: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Willy Tarreau <w@1wt.eu>
Diffstat (limited to 'net')
-rw-r--r-- | net/ipv4/tcp.c | 2 | ||||
-rw-r--r-- | net/socket.c | 6 |
2 files changed, 4 insertions, 4 deletions
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c index b9644d826653..6232462ffcb8 100644 --- a/net/ipv4/tcp.c +++ b/net/ipv4/tcp.c @@ -847,7 +847,7 @@ wait_for_memory: } out: - if (copied) + if (copied && !(flags & MSG_SENDPAGE_NOTLAST)) tcp_push(sk, flags, mss_now, tp->nonagle); return copied; diff --git a/net/socket.c b/net/socket.c index d449812d6208..bf9fc68a554c 100644 --- a/net/socket.c +++ b/net/socket.c @@ -732,9 +732,9 @@ static ssize_t sock_sendpage(struct file *file, struct page *page, sock = file->private_data; - flags = !(file->f_flags & O_NONBLOCK) ? 0 : MSG_DONTWAIT; - if (more) - flags |= MSG_MORE; + flags = (file->f_flags & O_NONBLOCK) ? MSG_DONTWAIT : 0; + /* more is a combination of MSG_MORE and MSG_SENDPAGE_NOTLAST */ + flags |= more; return kernel_sendpage(sock, page, offset, size, flags); } |