2012-07-04

Modern C++ API in Muduo, part 1

When writing a C++ library, we have to design the interface (API) carefully [Effective C++ 3rd ed. Item 18: Make interfaces easy to use correctly and hard to use incorrectly]. Muduo uses a modern C++ API style, which consists of two aspects: callbacks and object lifetime management. This post talks about callbacks.


In traditional object-oriented C++ programs, callbacks are usually done using virtual functions. A network library would define one or more base classes, to be inherited by the user code. For example, EventHandler class or Handler class, which define some (pure-) virtual functions for event callbacks, e.g. onConnect(), onDisconnect(), onMessage(), onTimer(), and so on. In order to get event notification, user code has to inherit the base Handler, and overrides virtual functions that it cares in derived MyHandler class. Dynamic binding can only be done with pointers or references in C++, therefore the library must hold pointers or references of base Handler class, which actually refer to derived MyHandler class objects. This is legitimate in so-called pure OO languages like Java. [Java 8 will introduce closure, C# has delegate since day one.]


However in non-GC language like C++, this callback-via-virtual-function API style has a fundamental difficulty: how to manage the lifetime of MyHandler objects.
  • Is MyHandler one-per-process, or one-per-socket? Does the handler object own the socket file descriptor? If so, Handler should be non-copyable, otherwise there is a risk of double freeing the socket fd.
  • Is MyHandler object created by the library? If so, does user code need to provide a derived class of HandlerFactory and register it to the library?
  • Nevertheless, the ownership of MyHandler is vague, does the library own it?  Can library safely delete the Handler pointer it holds? Presumably the base Handler class has a virtual destructor, of course.
  • How does the library ensure that deleting the object would not cause any dangling pointers in other places of the program?
There are many traps and pitfalls like this if the C++ library interface is designed this way, see section 5 of "The Design and Use of the ACE Reactor" http://www.cs.wustl.edu/~schmidt/PDF/reactor-rules.pdf


Anyway, I’d consider ‘delete this’ a bad smell in C++ code.


In modern C++, [I meant since TR1 finalized in 2005, not C++11], there is alternative way for callbacks: std::function, aka. std::tr1::function, boost::function. Here we will use boost::function, as Muduo is written in C++03, not C++11.


Muduo library exposes all its callback functionalities using boost::function. It expects user code passing in a few boost::function s for various events, such as connection event, data arrival event, and timer event. The library keeps a copy of the boost::function passed in. When an event happens, Muduo invokes the boost::function, which in turn calls user code. Unlike MyHandler, boost::function is a value object, just like std::string and std::vector, the library can hold a copy of boost::function, no need to worry about its lifetime. A boost::function never invalidates*.


Advantages:

  • User code doesn’t need to inherit base Handler class that defined in library. There is no such base class in Muduo. Avoid the vague ownership of MyHandler objects.
  • User code even doesn’t need to use class at all. For simple network programs, plain functions are good for callbacks and event handlers (handler in general sense).
  • User code may use member functions as callbacks, using boost::bind to convert a member function and its owning object into a boost::function. The member function that receives callbacks doesn’t have to be in any particular name, onConnection, onMessage are common names though. It doesn’t need to be the same name as base class virtual function, there is no base class anyway. Moreover, its signature can be slightly varied from the boost::function declaration, taking more or less parameters.
  • Such a bound boost::function should be safe to call by library, so be careful about the lifetime of the object that the member function belongs to, try not to boost::bind a member function to plain object pointer (this pointer). Either boost::bind member function to shared_ptr of that class (or weak_ptr with a forward function), or unregister before the object being destroyed,  or boost::bind member function to long lived object that survives until process terminates.
  • Flexible. a class may let its member function A receives data from connection A, and its member function B receives data from connection B. In traditional OO design, since a class can only inherit one base class once, you will have to play some tricks to make this happen. see muduo/examples/socks4a/Tunnel.h for an example.

In general, library interfaces with boost::function put much less limitation on user code.


Worry about performance ?



Calling a boost::function could be a bit slower than calling a virtual function (at least call a pointer-to-member-function can’t be faster than call a virtual function), does it matter? I don’t think so, because IO events involve system calls. Every time the library has something to notify user code, it must have done some syscall(s), eg. accept(2) for connection callback, read(2) for data callback and connection callback, etc. A syscall traps into kernel and does context switch, which is much higher costly than a C++ virtual function call or boost::function call, anyway. So the performance overhead of boost::function is negligible.


By using boost::function as event callback mechanism of network library, the interface is much cleaner than traditional OO design IMO, and hard to use incorrectly.

This post is long enough in itself, let me talk about object lifetime management in multithreaded programs next time, thread-safe destruction is an interesting problem.


2011-04-08

Essentials of non-blocking TCP network programming on Linux

Author: Shuo CHEN (chenshuo#chenshuo.com)

http://code.google.com/p/muduo/

If you can read Chinese, you may read my original post instead http://blog.csdn.net/Solstice/archive/2011/02/02/6171831.aspx , some materials here are new though.

Section 6.2 of Unix Network Programming vol.1 3rd ed (UNPv1) presents a good summary of five IO models on Unix/Linux, namely blocking, non-blocking, IO multiplexing, signal-driven, and asynchronous. Although the authors doesn't talk about threads.

The C10k problem page states five IO strategies, which takes threads into account.

In this multi-core era, threads are inevitable. One would ask how to build efficient and scalable multithreaded network server-side programs? Which thread model(s) to use? Any rule of thumb we could follow and make our decision easier?

The author of libev suggests that one loop per thread is usually a good model.

one loop per thread is usually a good model.

Doing this is almost never wrong, sometimes a better-performance model exists, but it is always a good start.

Given that, the question becomes how to build an efficient and easy-to-use event loop, then we can make up multithreaded program with the on loop per thread model. (and maybe plus a thread pool for pure computing tasks.)

My personal experiences lead to the Muduo network library, a modern C++ implementation of the reactor pattern for non-blocking network programming. Before I introduce the Muduo library, I'd like to show you the challenges of writing non-blocking network programs.

In real life, non-blocking is commonly used in conjunction with IO-multiplexing, for two reasons:

  • no one would real poll an IO event with a busy loop, it's wasting CPU cycles.
  • no one should use blocking IO with IO-multiplex, because an accept() or read() could block  indefinitely, see section 16.6 "Nonblocking accept" of UNPv1 for an example.

So when I say non-blocking, I actually mean non-blocking + IO-multiplexing. You can't use any one of them standalone.

IMO, the accidental complexity of networking program is dealing with Sockets API, resource management, dealing with concurrence, etc. While the essential complexity is reacting on IO events, ie. the business logic is what you do when you get a request from a client, this is the propose of your program, you should spend most of your valuable time working on it, not playing around Sockets APIs.

The three and half essential events in TCP programming

IMO, TCP networking is dealing with three and half events. (connection and socket are interchangeable in this port.)

  1. Connection established.
  2. TCP socket disconnected.
  3. Message received.
  4. Message sent complete. this counts 0.5.

A good network library should wrap these events and provide an interface allowing client code to register callbacks for them. See my next post for an example of how to do it with Muduo.

Connection established

For TCP server, this event means accept()ed a connection from client. (A good library should accept() for you, and pass you the established connection; not just telling you the listening file descriptor is readable.) For TCP client, this event means connect() succeeds and the socket becomes writable.

After connection is established, there is no difference between server-side programming and client-side programming, both sides are talking through the TCP socket, either of them can send data to other end, either of them can actively close the connection.

In muduo, if you write a TCP server, you will use TcpServer class, if you write a TCP client, you will TcpClient class. But these only differ in how to establish a connection, after a connection is made, you will be using TcpConnection for data exchanging, no matter which side your program stays on.

Of course a program can play both roles, without using threads, just make TcpServer and TcpClient sharing one EventLoop.

TCP socket disconnected

Both server and client can actively close an established connection. Normally this is done by calling close() or shutdown() to send a FIN to peer, the other end will read() an 0, then close() the connection.

Quick quiz: What if the passive-close end doesn't call close() (maybe it's deadlock.), Which TCP state will it be in? Which TCP state will the active-close end be in? What tool can you use for diagnosing this issue online? hint: FIN_WAIT_2, CLOSE_WAIT, half-close. For details for TCP states, see chapter 18 of the great book TCP/IP Illustrated vol. 1.

Usually there will be a pre-agreement of which end should initiate disconnecting, but the both should be prepared for unwanted disconnection, because the other side could crash.

The one who initiates disconnecting will enter TIME_WAIT state for some time, this will hold some resource of kernel for that time period. Generally we'd like to design the protocol so that the client disconnect actively, for saving resource on server, which may serves a lot clients simultaneously.

Message received

This is meat of network programming, the most important event to deal with. It generally means a socket becomes readable. We say message received not data received, because TCP is a byte stream protocol, application layer would normally send message in frames, ie. each message has a fixed-length header for the size of message body, so that the receiver can decode the frames and split bytes into individual messages.

(packet, fragmentation, segment, and frame are names for different layers, although frame is also used in Ethernet layer sometimes.)

A good library should read() the data for client, and call either onMessage() or onDisconnect() callback depends on the return value of read(). And possibly deal with buffering/framing/decoding.

Message sent complete

This counts for 0.5. This event means the data has been written to TCP/IP stack, not necessarily means the other end has received the data. Not all programs care about this event.

If you build high throughput server with non-blocking IO, you'd watch this event, not to send too fast and consume too much memory for buffering data in process's address space.

Why buffered IO is indispensable in non-blocking network programming

The whole idea of non-blocking programming is never blocking on an read() or write() or equivalents, we want to reuse one thread-of-control as much as possible, so that we can serve many client with one thread. An IO thread can only block on select()/poll()/epoll_wait(). Therefore, application-level buffers are needed in TcpConnection class.

TcpConnection must have an output buffer. Because if the client code wants to write 40k bytes, and the OS only lets you write 25k at the moment, you don't want to wait until the OS buffer becomes available again (waiting the peer slides forward their TCP window), this would block your program. So the library must take over the remaining 15k data, store in its output buffer, and send the data as soon as the socket becomes writable. If the client code wants to write another 50k bytes, and there are still 5k in output buffer, the library should append the 50k data to connection's output buffer, instead of try writing it immediately.

TcpConnection must have an input buffer. Because by nature of TCP, two 10k messages (20k in total) can arrive in any fashion, 20k data at once, two reads of 5k+15k, two reads of 15k+5k, two reads of 10k+10k, three reads of 6k+6k+8k, etc. The library should buffer the received data somewhere if the message is incomplete, and calls back client code as soon as it gets a complete message. and of course make the buffer reusable for next coming message.

Lighttpd had a problem of TCP bisection: data arrives half way through a "\r\n\r\n":

if the \r and the \n is put into different TCP packets, Lighttpd is then unable to recognize it as successful request, …

A good acceptance test would be that library working correctly when data arrives byte-by-byte, 10ms apart from each other. There was a security hole of Lighttpd that cause DoS, if client sends request slowly and piecewisely.

Challenges of non-blocking network programming

Network developers face many challenges in non-blocking IO, here's my short list when I design and implement Muduo.

  • If you want to actively close a TCP connection, how to ensure the other end has received all data? as we said before, a TcpConnection has its output buffer, there might be outgoing data in the buffer, so you probably should not call close(2) or shutdown(2) directly.
  • If you initiate a connection, but the server actively rejects it, how to retry with back-off?  (eg. server program is not running at the moment, the OS sends RST on each connection attempts, but you want your program functions automatically once the server program starts.)
  • Edge trigger or level trigger? If level trigger, when to watch EPOLLOUT event? Is epoll(4) always faster than poll(2)? (Muduo use level trigger, and you can choose the polling backend between epoll(4) and poll(2).)
  • How to read() efficiently? On one hand, we'd like to save number of syscalls, so read() as much as possible when a socket becomes readable, so this seems implies preparing a large (say 64k, match TCP window size) input buffer for each TcpConnection. On the other hand, we'd like to reduce memory usage, if a server has 10,000 connections, and allocates 64k buffer for each connection, it consumes 640M of memory. and most of the time only the first a few kilobytes of buffers are used, others are wasted. You may think about using ioctl()/FIONREAD to get the data size in advance before read()ing, but this doubles number of syscalls. Muduo use a clever way to read data efficiently, by using scatter/gather IO with memory on stack.
  • Buffer size should be dynamic, not a fixed-size array. Because message size varies, and you can't use the maximum possible size as default buffer size as it's too memory consumption.
  • How to deal with slow receiver? If you keep sending data blindly, will your output buffer explode? that's way we need the "message sent complete" callback.
  • If a thread is blocking on epoll_wait(), how to wake it up and add a new channel into its watching list? (eventfd() is more efficient than the usual pipe(2) or socketpair(2) on Linux.)
  • How to deal with timer? How to schedule some periodical tasks? The old fashion way is use a timer queue and let it calculates how long to epoll_wait(). But the resolution is only millisecond, can we do better? (timerfd() on Linux is made for this, and it's a usual file descriptor, can be added to epoll's watch list and share the same thread as IO events.)
  • I didn't mention signal handling yet. Linux provides a signalfd() syscall, so we can do it uniformly like other IO events, not to worry about reentrant code and asynchronous notification, etc.

Now I guess you get the idea that non-blocking IO is much more difficult than blocking IO. It is worth to pay the cost if you are building a scalable server-side program, and if you have got a good network library, it should save you and simplify things a lot.

(to be continued.)