.. _UnbufferedCommunication: Unbuffered RDMA Communication ============================= The main advantage of RDMA network communication is the support for zero-copy network transfers, i.e. data transfers from the user space of one node to the user space of another node without any memory copies. In other words, on the sender side the network card reads the message from the host memory location, on the receiving it writes the incoming message in a predefined memory location. In the context of NetIO-next zero-copy communication is referred to as "unbuffered". Unbuffered communication is convenient for workloads characterised by large messages (>O(kB)). If instead the workload is characterised by a high rate if small (O(byte)) messages data coalescence is more convenient: messages are copied in larger buffers and transferred once full. Data coalescence is supported by NetIO-next and is referred to as buffered communication, see :ref:`BufferedCommunication`. For both buffered and unbuffered communication, two messaging patterns are supported: point-to-point and publish-subscribe. The former is intended for communication between two remote endpoints and is typically used to send data to the FELIX PC, the latter allows remote endpoints to receive streams of data on demand, upon subscription. Point-to-Point Communication ---------------------------- NetIO-next supports unidirectional unbuffered point-to-point communication using socket types: - *send sockets* (`struct netio_send_socket`): the sending side of a connection. - *listen sockets* (`struct netio_listen_socket`): listen for incoming connections and creates receive sockets to form connection pairs. - *receive sockets* (`struct netio_recv_socket`): the receiving side of a connection, created by a listen socket. Socket Initialization ..................... Receive sockets are created by listen sockets when a connection is opened, so users need to manually create and initialize only send and listen sockets. For initialization of these, the following functions are used: .. doxygenfunction:: netio_init_send_socket :no-link: .. doxygenfunction:: netio_init_listen_socket :no-link: Unbuffered sockets support a set of user-definable callbacks for different connection events. For send sockets, the following callbacks are supported:: void (*cb_connection_established)(struct netio_send_socket*); void (*cb_connection_closed)(struct netio_send_socket*); void (*cb_internal_connection_closed)(struct netio_send_socket*); void (*cb_send_completed)(struct netio_send_socket*, uint64_t key); void (*cb_error_connection_refused)(struct netio_send_socket*); `cb_connection_established` and `connection closed` update the application about the connection status. `cb_internal_connection_closed` is a callback internal used to clear resources in the appropriate way depending on whether the `send_socket` is associated to a `netio_unbuffered_publish_socket` or not. `cb_send_completed` reports that a send operation successfully completed and the memory location that stored the message can be reused. In the felix-tohost FELIX readout application this function is used to report that data written by the FELIX card has been sent and the FELIX card is free to overwrite it. For listen sockets, the set of supported callbacks is:: void (*cb_connection_established)(struct netio_recv_socket*); void (*cb_connection_closed)(struct netio_recv_socket*); void (*cb_msg_received)(struct netio_recv_socket*, struct netio_buffer*, void*, size_t); void (*cb_msg_imm_received)(struct netio_recv_socket*, struct netio_buffer*, void*, size_t, uint64_t); void (*cb_error_bind_refused)(struct netio_listen_socket*); The callback `cb_msg_received` notifies the receiver that a message length `s` located at `data` has been received on buffer `b` of receving socket `r`. `cb_msg_imm_received` is the same as `cb_msg_received` except for an extra 8-byte argument used to receive *immediate data*, described below in the context of send functions. As all other callbacks, `cb_msg_received` is run by the event loop thread. Memory Management ................. As mentioned in the previous section MRs used for sending or receving messages need to be pinned. NetIO-next distinguishes between memory that is used for send operations and memory that is used for receive operations. Any ordinary user space buffer can be registered [#]_. The following two functions can be used for the memory registration: .. doxygenfunction:: netio_register_send_buffer :no-link: .. doxygenfunction:: netio_register_recv_buffer :no-link: The explicit use of `netio_register_recv_buffer` is being deprecated in favour of a buffer allocation and registration in `netio_init_listen_socket` via attributes passed to the function. Establishing a Connection ......................... Listen sockets need to be bound to a network interface and put into listening mode: .. doxygenfunction:: netio_listen :no-link: Then send sockets can connect: .. doxygenfunction:: netio_connect :no-link: Send sockets can also disconnect from an established connection: .. doxygenfunction:: netio_disconnect :no-link: Sending and Receiving Data .......................... Starting from the receiving side, first the receiver has to post one or more buffers to the socket which will be used to receive data: .. doxygenfunction:: netio_post_recv :no-link: Naturally, the receive buffer needs to be previously registered using the memory registration function described above. Upon message reception, a signal from the CC associated to the receiving socket, will invoke the `cb_msg_received` callback. On the sending side there is more than one option to send a message:: int netio_send_buffer(struct netio_send_socket* socket, struct netio_buffer* buf); int netio_send(struct netio_send_socket* socket, struct netio_buffer* buf, void* addr, size_t size, uint64_t key); int netio_send_imm(struct netio_send_socket* socket, struct netio_buffer* buf, void* addr, size_t size, uint64_t key, uint64_t imm); int netio_sendv(struct netio_send_socket* socket, struct netio_buffer** buf, struct iovec* iov, size_t count, uint64_t key); int netio_sendv_imm(struct netio_send_socket* socket, struct netio_buffer** buf, struct iovec* iov, size_t count, uint64_t key, uint64_t imm); The simplest function is `netio_send_buffer`. As the name suggest, this function sends a full buffer to the remote endpoint. .. doxygenfunction:: netio_send_buffer :no-link: When the buffer is successfully transmitted to the remote, the `cb_send_completed` callback will be called as a response to a CO written by the network stack in the CQ associated to the send socket. This callback has two parameters, `socket` and `key`. The first parameter refers the send socket that issued the send operation. The `key` parameter is used by the user to identify the individual send operation that completed. In case of the `netio_send_buffer` method, `key` will be set to the address of the buffer that was sent. All other send operations include a parameter `key` that will be passed back into the completion callback. The key can be set freely by the user. .. note:: The specific condition that triggers the creation of a CO on the sender side is FI_INJECT_COMPLETE. Other conditions exist and are listed in `libfabric's documentaion `_ The next send function is `netio_send`: .. doxygenfunction:: netio_send :no-link: This function does not transmit a complete buffer, but only a sub-region of this buffer. The sub-region is identified by the parameter `data` and `size`. In addition to the data in the buffer the user can pass a few bytes of extra data to the remote. This is called `immediate data` and will be transported as part of the underlying protocol headers. If the user wants to make use of immediate data, the function to use is .. doxygenfunction:: netio_send_imm :no-link: On the receiving side, messages with immediate data are received by calling the second receive callback, `cb_msg_imm_received`, which includes the additional `imm` parameter. Messages without immediate data result in a call of the `cb_msg_received` callback. If a message with immediate data is received, but `cb_msg_imm_received` is not specified (NULL), `cb_msg_received` will be called instead and the immediate data will be dropped. Both `netio_send` and `netio_send_imm` come in versions that allow the use of a scatter/gather vector instead of a single message pointer. As such, they require a vector of buffers as well. Users must take care that every entry in the IO vector is fully contained within the associated send buffer (i.e. all entried resider in a memory regions). The function declaration are shown below. .. doxygenfunction:: netio_sendv :no-link: .. doxygenfunction:: netio_sendv_imm :no-link: .. note:: The scatter/gather vector has a maximum number of entries. This number is defaulted by libfabric to a 4 but it can be changed setting the environment variables FI_VERBS_TX_IOV_LIMIT and FI_VERBS_RX_IOV_LIMIT (to be set to the same values on both sending and receiving side). The hard limit depends on hardware and for Nvidia Connect-X5 is 30. The corresponding hardware limits are called `max_srq_sge` and `max_sge` and can be probed with the command `ibv_devinfo -v`. NetIO-next contains a hardcododed limit NETIO_MAX_IOV_LEN set to 28. .. warning:: The buffers posted by the receiver side need to be large enough to accomodate the inbound messages. If this is not the case, on the occurrence of the first non-fitting message the sender CQ will move to error state and will not allow further messages to be sent. Publish/Subscribe Communication ------------------------------- Publish/Subscribe is a communication pattern in which a publisher sends messages to a dynamic list of subscribers. Messages are categorized in streams using numeric message tags called felix identifiers (fid) to which subscribers can subscribe. A subscriber can subscribe to one or many different fid of a publisher. The publisher maintains an internal subscription table which contains connections to the various subscribers. Connection management is automatic and publishers do not need to connect to (or even be aware of) any subscribers. The publish/subscribe communication pattern in unbuffered mode works as follows. First a publish socket needs to be initialised on the sender side and a subscribed socket on the receving side. .. doxygenfunction:: netio_unbuffered_publish_socket_init :no-link: .. doxygenfunction:: netio_unbuffered_subscribe_socket_init :no-link: On the publisher side a single send buffer is passed: the use case in mind is to use the large send buffer as data pool and then send over the network messages consisting of scatter/gather vectors pointing to one or more locations. On the subscriber side a number and size of buffers need to be passed: the allocation is done internally. The publish socket has the following callbacks available for the application. These callbacks are exposed to the application and allow to perform operations upon subscription, connection and successfull transmission events.:: void (*cb_subscribe)(struct netio_unbuffered_publish_socket*, netio_tag_t, void*, size_t); void (*cb_unsubscribe)(struct netio_unbuffered_publish_socket*, netio_tag_t, void*, size_t); void (*cb_connection_established)(struct netio_unbuffered_publish_socket*); void (*cb_connection_closed)(struct netio_unbuffered_publish_socket*); void (*cb_msg_published)(struct netio_unbuffered_publish_socket*, uint64_t); The subscribe socket has analogous callbacks for connection events and a callback for message reception. :: void (*cb_connection_established)(struct netio_unbuffered_subscribe_socket*); void (*cb_connection_closed)(struct netio_unbuffered_subscribe_socket*); void (*cb_error_connection_refused)(struct netio_unbuffered_subscribe_socket*); void (*cb_msg_received)(struct netio_unbuffered_subscribe_socket*, netio_tag_t, void*, size_t); Subscriptions and unsubscriptions are performed by the subscriber with fid granularity using .. doxygenfunction:: netio_unbuffered_subscribe :no-link: .. doxygenfunction:: netio_unbuffered_unsubscribe :no-link: (Un)subscriptions happen by exchanging messages using the same network protocol used to exchange data. (Un)subscription messages are sent by the send socket included in the `netio_unbuffered_subscribe_socket` and contain a `struct netio_subscription_message` each. (Un)subscription messages are received by a `netio_recv_socket` spawned by the `netio_listen_socket` included in the `netio_unbuffered_publish_socket` on connection request. Allocation and registratiuon of buffers for subscriptions is not exposed. .. note:: If a remote client unsubscribes from all fids it was subscribed to, the publisher closes the connection. Inside NetIO-next the procedure unfolds as follows: the send socket assigned to the remote client and belonging to the publish socket sends an shutdown message (FI_SHUTDOWN) that is received by the the receiving socket of the subscribe socket. The subscribe socket echoes the shutdown via its send socket used for subscription messages causing the closure of the receiving socket of the publisher socket. This exchange of shutdown messages ensures that all resources associated to a closed connection are freed. Data is published with `netio_unbuffered_publishv_usr`. .. doxygenfunction:: netio_unbuffered_publishv_usr :no-link: The `usr` field allows the user to insert up to 8 bytes of user data at the beginning of the message. This is useful to add a small header as part of a user-defined communication protocol. The user data do *not* need to be part of a registered netio buffer. In the context of FELIX the `usr` field is used to carry the `status byte` header that indicates error conditions. The input/output parameter `key` is used to track the completion of publish operations. A publish operation can trigger send operations on multiple connections and can therefore produce multiple completion events in the RDMA hardware. To keep count of the number of received completions, NetIO-next uses the 8-byte indicated by `key`. The `key` is passed to the CO used for the send operation; in this way as the send operation completes a CO containing `key` is notified. A stack of COs is allocated in `netio_unbuffered_publish_socket_init` and it is common to all `netio_send_socket` belonging to the same `netio_unbuffered_publish` and sending data to different remote endpoints. COs are re-used: sending a message requires popping one CO from the stack, while `cb_msg_published` returned a CO to the stack. The size of the completion stack determines how many outstanding messages there can be at a given time. The return value of the unbuffered publish call is important. If NETIO_STATUS_PARTIAL is returned, data was sent successfully only to a subset of the subscribed nodes. This means users need to call the publish function again in the future with the same parameters, and the NETIO_REENTRY flag set. NETIO_STATUS_AGAIN means that there were not enough resources to process the operation. The user should issue the same publish call again, but in this case without the NETIO_REENTRY flag. The callback `cb_msg_published` is invoked only when data was sent to all subscribed remote endpoints. The object that internally keeps track of successfull deliveries and triggers `cb_msg_published` is `netio_semaphore` and it is used only in unbuffered mode. .. note:: In felix-tohost the `key` field is used to store the address of a message in the FELIX DMA buffer. Once the message has been delivered to all subscribers the semaphore triggers `cb_msg_published` that contains the CO carrying the `key`. The value of `key` is used to advance the FELIX DMA read pointer allowing firmware to write that location. ------------ .. [#] Virtual memory needs to be backed by `struct pages*` in the Linux kernel. This is the case for any ordinary memory allocated using `malloc` and similar functions. Memory addresses obtained from device drivers that perform their own mapping into virtual address space may be problematic.