.. _libfabric: Libfabric backend ================= This chapter explains some of the details of the libfabric backend implementation. Most of them are dictated by the libfabric library. The `libfabric documentation `_ provides a good explanation of the library and its concepts. Sending data ------------ Libfabric aims to provide maximum performance. One way to achieve high performance is to avoid data copies. Instead of copying data into kernel space, we need to register memory regions from which we can send data. Sending data from outside memory regions is impossible. We also need to guarantee that after initiating a send operation, the data is not modified until the operation is completed. For buffered sending, we allocate the given number of buffers and register them with libfabric. Matching send operations with completions is done internally. The only thing a user has to be careful is to not modify the buffer after ``send_buffer`` is called. In fact, ``send_buffer`` should be treated as if you give back the buffer you received when calling ``get_buffer``. If you need a new buffer, you need to call ``get_buffer`` again. As a user, you should never have to care about the key being returned by the send completion callback. For zero-copy sending, we are registering one memory region that is provided via the connection parameters and another region that contains the additional header data (16 bytes per send operation). The header data is copied into the registered memory region and can therefore be modified or deleted after the send operation is initiated. For the main data, the user has to guarantee that the data is not modified until the send operation is completed and the user has to match the send completion with the send operation. The need to register memory regions is also the reason why the immediate sending is not supported by the libfabric backend. Immediate sending would require copying the data into a registered memory region. If it does not reside inside a registered memory region (like for zero-copy), it needs to be copied into an allocated memory region which is precisly what buffered sending does. Receiving data -------------- As for sending data, we also need to register memory regions for receiving data. This is done my registering the given number of buffers with libfabric. When data is received, the data is placed into an available buffer by libfabric and provided to the user via the receive completion callback. Once the callback has been completed, the buffer is returned to the libfabric backend and can be used for the next receive operation. Connection management --------------------- Opening and closing connections is a quite expensive operation. Try to re-use connections as much as possible and avoid opening and closing connections frequently, especially during high load. Thread safety ------------- The libfabric can be configured to be thread-safe at the cost of some performance. If it is desirable to send data from different threads than the event loop thread, the performance penalty might still be lower than the extra cost of sharing data between threads and moving executions onto the event loop thread. If the libfabric backend is used in thread-unsafe mode no functions may be called from different threads than the event loop thread. Recommendations --------------- - Use the libfabric backend only if you have appropriate hardware. - Use the libfabric backend if you need a high-performance and low-latency network backend. - Avoid opening and closing connections frequently. - Every receiver will allocate receive buffers, every buffered sender send buffers. Make sure to limit the number of connections to avoid running out of memory. - Zero-copy sending is more performant but is trickier to use. Make sure to understand the implications of zero-copy sending. - Buffered sending can be faster than zero-copy sending for small messages as multiple small messages can be sent in one operation. - Prefer the ``EpollEventLoop`` backend for the libfabric backend. - Make sure callbacks never block the event loop. It causes performance degradation and might deadlock the application.