2011-04-08

Essentials of non-blocking TCP network programming on Linux

Author: Shuo CHEN (chenshuo#chenshuo.com)

http://code.google.com/p/muduo/

If you can read Chinese, you may read my original post instead http://blog.csdn.net/Solstice/archive/2011/02/02/6171831.aspx , some materials here are new though.

Section 6.2 of Unix Network Programming vol.1 3rd ed (UNPv1) presents a good summary of five IO models on Unix/Linux, namely blocking, non-blocking, IO multiplexing, signal-driven, and asynchronous. Although the authors doesn't talk about threads.

The C10k problem page states five IO strategies, which takes threads into account.

In this multi-core era, threads are inevitable. One would ask how to build efficient and scalable multithreaded network server-side programs? Which thread model(s) to use? Any rule of thumb we could follow and make our decision easier?

The author of libev suggests that one loop per thread is usually a good model.

one loop per thread is usually a good model.

Doing this is almost never wrong, sometimes a better-performance model exists, but it is always a good start.

Given that, the question becomes how to build an efficient and easy-to-use event loop, then we can make up multithreaded program with the on loop per thread model. (and maybe plus a thread pool for pure computing tasks.)

My personal experiences lead to the Muduo network library, a modern C++ implementation of the reactor pattern for non-blocking network programming. Before I introduce the Muduo library, I'd like to show you the challenges of writing non-blocking network programs.

In real life, non-blocking is commonly used in conjunction with IO-multiplexing, for two reasons:

  • no one would real poll an IO event with a busy loop, it's wasting CPU cycles.
  • no one should use blocking IO with IO-multiplex, because an accept() or read() could block  indefinitely, see section 16.6 "Nonblocking accept" of UNPv1 for an example.

So when I say non-blocking, I actually mean non-blocking + IO-multiplexing. You can't use any one of them standalone.

IMO, the accidental complexity of networking program is dealing with Sockets API, resource management, dealing with concurrence, etc. While the essential complexity is reacting on IO events, ie. the business logic is what you do when you get a request from a client, this is the propose of your program, you should spend most of your valuable time working on it, not playing around Sockets APIs.

The three and half essential events in TCP programming

IMO, TCP networking is dealing with three and half events. (connection and socket are interchangeable in this port.)

  1. Connection established.
  2. TCP socket disconnected.
  3. Message received.
  4. Message sent complete. this counts 0.5.

A good network library should wrap these events and provide an interface allowing client code to register callbacks for them. See my next post for an example of how to do it with Muduo.

Connection established

For TCP server, this event means accept()ed a connection from client. (A good library should accept() for you, and pass you the established connection; not just telling you the listening file descriptor is readable.) For TCP client, this event means connect() succeeds and the socket becomes writable.

After connection is established, there is no difference between server-side programming and client-side programming, both sides are talking through the TCP socket, either of them can send data to other end, either of them can actively close the connection.

In muduo, if you write a TCP server, you will use TcpServer class, if you write a TCP client, you will TcpClient class. But these only differ in how to establish a connection, after a connection is made, you will be using TcpConnection for data exchanging, no matter which side your program stays on.

Of course a program can play both roles, without using threads, just make TcpServer and TcpClient sharing one EventLoop.

TCP socket disconnected

Both server and client can actively close an established connection. Normally this is done by calling close() or shutdown() to send a FIN to peer, the other end will read() an 0, then close() the connection.

Quick quiz: What if the passive-close end doesn't call close() (maybe it's deadlock.), Which TCP state will it be in? Which TCP state will the active-close end be in? What tool can you use for diagnosing this issue online? hint: FIN_WAIT_2, CLOSE_WAIT, half-close. For details for TCP states, see chapter 18 of the great book TCP/IP Illustrated vol. 1.

Usually there will be a pre-agreement of which end should initiate disconnecting, but the both should be prepared for unwanted disconnection, because the other side could crash.

The one who initiates disconnecting will enter TIME_WAIT state for some time, this will hold some resource of kernel for that time period. Generally we'd like to design the protocol so that the client disconnect actively, for saving resource on server, which may serves a lot clients simultaneously.

Message received

This is meat of network programming, the most important event to deal with. It generally means a socket becomes readable. We say message received not data received, because TCP is a byte stream protocol, application layer would normally send message in frames, ie. each message has a fixed-length header for the size of message body, so that the receiver can decode the frames and split bytes into individual messages.

(packet, fragmentation, segment, and frame are names for different layers, although frame is also used in Ethernet layer sometimes.)

A good library should read() the data for client, and call either onMessage() or onDisconnect() callback depends on the return value of read(). And possibly deal with buffering/framing/decoding.

Message sent complete

This counts for 0.5. This event means the data has been written to TCP/IP stack, not necessarily means the other end has received the data. Not all programs care about this event.

If you build high throughput server with non-blocking IO, you'd watch this event, not to send too fast and consume too much memory for buffering data in process's address space.

Why buffered IO is indispensable in non-blocking network programming

The whole idea of non-blocking programming is never blocking on an read() or write() or equivalents, we want to reuse one thread-of-control as much as possible, so that we can serve many client with one thread. An IO thread can only block on select()/poll()/epoll_wait(). Therefore, application-level buffers are needed in TcpConnection class.

TcpConnection must have an output buffer. Because if the client code wants to write 40k bytes, and the OS only lets you write 25k at the moment, you don't want to wait until the OS buffer becomes available again (waiting the peer slides forward their TCP window), this would block your program. So the library must take over the remaining 15k data, store in its output buffer, and send the data as soon as the socket becomes writable. If the client code wants to write another 50k bytes, and there are still 5k in output buffer, the library should append the 50k data to connection's output buffer, instead of try writing it immediately.

TcpConnection must have an input buffer. Because by nature of TCP, two 10k messages (20k in total) can arrive in any fashion, 20k data at once, two reads of 5k+15k, two reads of 15k+5k, two reads of 10k+10k, three reads of 6k+6k+8k, etc. The library should buffer the received data somewhere if the message is incomplete, and calls back client code as soon as it gets a complete message. and of course make the buffer reusable for next coming message.

Lighttpd had a problem of TCP bisection: data arrives half way through a "\r\n\r\n":

if the \r and the \n is put into different TCP packets, Lighttpd is then unable to recognize it as successful request, …

A good acceptance test would be that library working correctly when data arrives byte-by-byte, 10ms apart from each other. There was a security hole of Lighttpd that cause DoS, if client sends request slowly and piecewisely.

Challenges of non-blocking network programming

Network developers face many challenges in non-blocking IO, here's my short list when I design and implement Muduo.

  • If you want to actively close a TCP connection, how to ensure the other end has received all data? as we said before, a TcpConnection has its output buffer, there might be outgoing data in the buffer, so you probably should not call close(2) or shutdown(2) directly.
  • If you initiate a connection, but the server actively rejects it, how to retry with back-off?  (eg. server program is not running at the moment, the OS sends RST on each connection attempts, but you want your program functions automatically once the server program starts.)
  • Edge trigger or level trigger? If level trigger, when to watch EPOLLOUT event? Is epoll(4) always faster than poll(2)? (Muduo use level trigger, and you can choose the polling backend between epoll(4) and poll(2).)
  • How to read() efficiently? On one hand, we'd like to save number of syscalls, so read() as much as possible when a socket becomes readable, so this seems implies preparing a large (say 64k, match TCP window size) input buffer for each TcpConnection. On the other hand, we'd like to reduce memory usage, if a server has 10,000 connections, and allocates 64k buffer for each connection, it consumes 640M of memory. and most of the time only the first a few kilobytes of buffers are used, others are wasted. You may think about using ioctl()/FIONREAD to get the data size in advance before read()ing, but this doubles number of syscalls. Muduo use a clever way to read data efficiently, by using scatter/gather IO with memory on stack.
  • Buffer size should be dynamic, not a fixed-size array. Because message size varies, and you can't use the maximum possible size as default buffer size as it's too memory consumption.
  • How to deal with slow receiver? If you keep sending data blindly, will your output buffer explode? that's way we need the "message sent complete" callback.
  • If a thread is blocking on epoll_wait(), how to wake it up and add a new channel into its watching list? (eventfd() is more efficient than the usual pipe(2) or socketpair(2) on Linux.)
  • How to deal with timer? How to schedule some periodical tasks? The old fashion way is use a timer queue and let it calculates how long to epoll_wait(). But the resolution is only millisecond, can we do better? (timerfd() on Linux is made for this, and it's a usual file descriptor, can be added to epoll's watch list and share the same thread as IO events.)
  • I didn't mention signal handling yet. Linux provides a signalfd() syscall, so we can do it uniformly like other IO events, not to worry about reentrant code and asynchronous notification, etc.

Now I guess you get the idea that non-blocking IO is much more difficult than blocking IO. It is worth to pay the cost if you are building a scalable server-side program, and if you have got a good network library, it should save you and simplify things a lot.

(to be continued.)

No comments:

Post a Comment