Unbuffered RDMA Communication

The main advantage of RDMA network communication is the support for zero-copy network transfers, i.e. data transfers from the user space of one node to the user space of another node without any memory copies. In other words, on the sender side the network card reads the message from the host memory location, on the receiving it writes the incoming message in a predefined memory location.

In the context of NetIO-next zero-copy communication is referred to as “unbuffered”. Unbuffered communication is convenient for workloads characterised by large messages (>O(kB)). If instead the workload is characterised by a high rate if small (O(byte)) messages data coalescence is more convenient: messages are copied in larger buffers and transferred once full. Data coalescence is supported by NetIO-next and is referred to as buffered communication, see Buffered RDMA Communication.

For both buffered and unbuffered communication, two messaging patterns are supported: point-to-point and publish-subscribe. The former is intended for communication between two remote endpoints and is typically used to send data to the FELIX PC, the latter allows remote endpoints to receive streams of data on demand, upon subscription.

Point-to-Point Communication

NetIO-next supports unidirectional unbuffered point-to-point communication using socket types:

  • send sockets (struct netio_send_socket): the sending side of a connection.

  • listen sockets (struct netio_listen_socket): listen for incoming connections and creates receive sockets to form connection pairs.

  • receive sockets (struct netio_recv_socket): the receiving side of a connection, created by a listen socket.

Socket Initialization

Receive sockets are created by listen sockets when a connection is opened, so users need to manually create and initialize only send and listen sockets. For initialization of these, the following functions are used:

void netio_init_send_socket(struct netio_send_socket *socket, struct netio_context *ctx)

Initializes an unbuffered send socket.

Parameters:
  • socket – The socket to intialize

  • ctx – The netio context

void netio_init_listen_socket(struct netio_listen_socket *socket, struct netio_context *ctx, struct netio_unbuffered_socket_attr *attr)

Initializes an unbuffered listen socket.

Parameters:
  • socket – The socket to intialize

  • ctx – The netio context

Unbuffered sockets support a set of user-definable callbacks for different connection events. For send sockets, the following callbacks are supported:

void (*cb_connection_established)(struct netio_send_socket*);
void (*cb_connection_closed)(struct netio_send_socket*);
void (*cb_internal_connection_closed)(struct netio_send_socket*);
void (*cb_send_completed)(struct netio_send_socket*, uint64_t key);
void (*cb_error_connection_refused)(struct netio_send_socket*);

cb_connection_established and connection closed update the application about the connection status. cb_internal_connection_closed is a callback internal used to clear resources in the appropriate way depending on whether the send_socket is associated to a netio_unbuffered_publish_socket or not. cb_send_completed reports that a send operation successfully completed and the memory location that stored the message can be reused. In the felix-tohost FELIX readout application this function is used to report that data written by the FELIX card has been sent and the FELIX card is free to overwrite it.

For listen sockets, the set of supported callbacks is:

void (*cb_connection_established)(struct netio_recv_socket*);
void (*cb_connection_closed)(struct netio_recv_socket*);
void (*cb_msg_received)(struct netio_recv_socket*, struct netio_buffer*, void*, size_t);
void (*cb_msg_imm_received)(struct netio_recv_socket*, struct netio_buffer*, void*, size_t, uint64_t);
void (*cb_error_bind_refused)(struct netio_listen_socket*);

The callback cb_msg_received notifies the receiver that a message length s located at data has been received on buffer b of receving socket r. cb_msg_imm_received is the same as cb_msg_received except for an extra 8-byte argument used to receive immediate data, described below in the context of send functions. As all other callbacks, cb_msg_received is run by the event loop thread.

Memory Management

As mentioned in the previous section MRs used for sending or receving messages need to be pinned. NetIO-next distinguishes between memory that is used for send operations and memory that is used for receive operations. Any ordinary user space buffer can be registered [1]. The following two functions can be used for the memory registration:

void netio_register_send_buffer(struct netio_send_socket *socket, struct netio_buffer *buf, uint64_t flags)
void netio_register_recv_buffer(struct netio_recv_socket *socket, struct netio_buffer *buf, uint64_t flags)

The explicit use of netio_register_recv_buffer is being deprecated in favour of a buffer allocation and registration in netio_init_listen_socket via attributes passed to the function.

Establishing a Connection

Listen sockets need to be bound to a network interface and put into listening mode:

void netio_listen(struct netio_listen_socket *socket, const char *hostname, unsigned port)

Bind an unbuffered listen socket to an endpoint and listen for incoming connections.

Parameters:
  • socket – An unbuffered listen socket

  • hostname – Hostname or IP address of an endpoint

  • port – A port number to listen on

Then send sockets can connect:

void netio_connect(struct netio_send_socket *socket, const char *hostname, unsigned port)

Send sockets can also disconnect from an established connection:

void netio_disconnect(struct netio_send_socket *socket)

Disconnect a connected unbuffered send socket.

Parameters:
  • socket – A connected unbuffered send socket

Sending and Receiving Data

Starting from the receiving side, first the receiver has to post one or more buffers to the socket which will be used to receive data:

void netio_post_recv(struct netio_recv_socket *socket, struct netio_buffer *buf)

Post a receive buffer to an unbuffered receive socket.

Receive buffers must be registered using netio_register_recv_buffer.

Parameters:
  • socket – An unbuffered receive socket

  • buf – A registered receive buffer.

Naturally, the receive buffer needs to be previously registered using the memory registration function described above. Upon message reception, a signal from the CC associated to the receiving socket, will invoke the cb_msg_received callback.

On the sending side there is more than one option to send a message:

int netio_send_buffer(struct netio_send_socket* socket, struct netio_buffer* buf);
int netio_send(struct netio_send_socket* socket, struct netio_buffer* buf, void* addr, size_t size, uint64_t key);
int netio_send_imm(struct netio_send_socket* socket, struct netio_buffer* buf, void* addr, size_t size, uint64_t key, uint64_t imm);
int netio_sendv(struct netio_send_socket* socket, struct netio_buffer** buf, struct iovec* iov, size_t count, uint64_t key);
int netio_sendv_imm(struct netio_send_socket* socket, struct netio_buffer** buf, struct iovec* iov, size_t count, uint64_t key, uint64_t imm);

The simplest function is netio_send_buffer. As the name suggest, this function sends a full buffer to the remote endpoint.

int netio_send_buffer(struct netio_send_socket *socket, struct netio_buffer *buf)

Sends a full buffer over a connected unbuffered send socket.

Parameters:
  • socket – A connected, unbuffered send socket

  • buf – A registered send buffer

When the buffer is successfully transmitted to the remote, the cb_send_completed callback will be called as a response to a CO written by the network stack in the CQ associated to the send socket. This callback has two parameters, socket and key. The first parameter refers the send socket that issued the send operation. The key parameter is used by the user to identify the individual send operation that completed. In case of the netio_send_buffer method, key will be set to the address of the buffer that was sent. All other send operations include a parameter key that will be passed back into the completion callback. The key can be set freely by the user.

Note

The specific condition that triggers the creation of a CO on the sender side is FI_INJECT_COMPLETE. Other conditions exist and are listed in libfabric’s documentaion

The next send function is netio_send:

int netio_send(struct netio_send_socket *socket, struct netio_buffer *buf, void *addr, size_t size, uint64_t key)

Sends a partial buffer over a connected unbuffered send socket.

Parameters:
  • socket – A connected, unbuffered send socket

  • buffer – A registered send buffer

  • addr – Pointer to message within the buffer

  • size – Size of the message

  • key – Message key used to track the message progress.

This function does not transmit a complete buffer, but only a sub-region of this buffer. The sub-region is identified by the parameter data and size. In addition to the data in the buffer the user can pass a few bytes of extra data to the remote. This is called immediate data and will be transported as part of the underlying protocol headers. If the user wants to make use of immediate data, the function to use is

int netio_send_imm(struct netio_send_socket *socket, struct netio_buffer *buf, void *addr, size_t size, uint64_t key, uint64_t imm)

Sends a partial buffer with immediate data over a connected unbuffered send socket.

Parameters:
  • socket – A connected, unbuffered send socket

  • buffer – A registered send buffer

  • addr – Pointer to message within the buffer

  • size – Size of the message

  • key – Message key used to track the message progress.

  • imm – Immediate data, up to 8 byte (size is implementation-dependent)

On the receiving side, messages with immediate data are received by calling the second receive callback, cb_msg_imm_received, which includes the additional imm parameter. Messages without immediate data result in a call of the cb_msg_received callback. If a message with immediate data is received, but cb_msg_imm_received is not specified (NULL), cb_msg_received will be called instead and the immediate data will be dropped.

Both netio_send and netio_send_imm come in versions that allow the use of a scatter/gather vector instead of a single message pointer. As such, they require a vector of buffers as well. Users must take care that every entry in the IO vector is fully contained within the associated send buffer (i.e. all entried resider in a memory regions). The function declaration are shown below.

int netio_sendv(struct netio_send_socket *socket, struct netio_buffer **buf, struct iovec *iov, size_t count, uint64_t key)

Sends a partial buffer data over a connected unbuffered send socket.

Parameters:
  • socket – A connected, unbuffered send socket

  • buffer – A vector of registered send buffers

  • iov – Scatter/gather buffer describing message within the buffers

  • count – Size of the scatter/gather vector

  • key – Message key used to track the message progress.

int netio_sendv_imm(struct netio_send_socket *socket, struct netio_buffer **buf, struct iovec *iov, size_t count, uint64_t key, uint64_t imm)

Sends a partial buffer with immediate data over a connected unbuffered send socket.

Parameters:
  • socket – A connected, unbuffered send socket

  • buffer – A vector of registered send buffers

  • iov – Scatter/gather buffer describing message within the buffers

  • count – Size of the scatter/gather vector

  • key – Message key used to track the message progress.

  • imm – Immediate data, up to 8 byte (size is implementation-dependent)

Note

The scatter/gather vector has a maximum number of entries. This number is defaulted by libfabric to a 4 but it can be changed setting the environment variables FI_VERBS_TX_IOV_LIMIT and FI_VERBS_RX_IOV_LIMIT (to be set to the same values on both sending and receiving side). The hard limit depends on hardware and for Nvidia Connect-X5 is 30. The corresponding hardware limits are called max_srq_sge and max_sge and can be probed with the command ibv_devinfo -v. NetIO-next contains a hardcododed limit NETIO_MAX_IOV_LEN set to 28.

Warning

The buffers posted by the receiver side need to be large enough to accomodate the inbound messages. If this is not the case, on the occurrence of the first non-fitting message the sender CQ will move to error state and will not allow further messages to be sent.

Publish/Subscribe Communication

Publish/Subscribe is a communication pattern in which a publisher sends messages to a dynamic list of subscribers. Messages are categorized in streams using numeric message tags called felix identifiers (fid) to which subscribers can subscribe. A subscriber can subscribe to one or many different fid of a publisher.

The publisher maintains an internal subscription table which contains connections to the various subscribers. Connection management is automatic and publishers do not need to connect to (or even be aware of) any subscribers.

The publish/subscribe communication pattern in unbuffered mode works as follows. First a publish socket needs to be initialised on the sender side and a subscribed socket on the receving side.

void netio_unbuffered_publish_socket_init(struct netio_unbuffered_publish_socket *socket, struct netio_context *ctx, const char *hostname, unsigned port, struct netio_buffer *buf)

Initialize an unbuffered publish socket

Parameters:
  • socket – An unbuffered publish socket

  • ctx – A netio context

  • hostname – Local hostname to bind to

  • port – Local port to bind to

  • buf – A registered send buffer

void netio_unbuffered_subscribe_socket_init(struct netio_unbuffered_subscribe_socket *socket, struct netio_context *ctx, const char *hostname, const char *remote_host, unsigned remote_port, size_t buffer_size, size_t count)

Initialize an unbuffered subscribe socket

Parameters:
  • socket – An unbuffered subscribe socket

  • ctx – A netio context

  • hostname – A local hostname or IP to bind to

  • remote_host – Hostname or IP of the remote publish socket

  • remote_port – Port of the remote publish socket

  • buffers – Array of registered receive buffers

  • count – Size of the buffer array

On the publisher side a single send buffer is passed: the use case in mind is to use the large send buffer as data pool and then send over the network messages consisting of scatter/gather vectors pointing to one or more locations. On the subscriber side a number and size of buffers need to be passed: the allocation is done internally.

The publish socket has the following callbacks available for the application. These callbacks are exposed to the application and allow to perform operations upon subscription, connection and successfull transmission events.:

void (*cb_subscribe)(struct netio_unbuffered_publish_socket*, netio_tag_t, void*, size_t);
void (*cb_unsubscribe)(struct netio_unbuffered_publish_socket*, netio_tag_t, void*, size_t);
void (*cb_connection_established)(struct netio_unbuffered_publish_socket*);
void (*cb_connection_closed)(struct netio_unbuffered_publish_socket*);
void (*cb_msg_published)(struct netio_unbuffered_publish_socket*, uint64_t);

The subscribe socket has analogous callbacks for connection events and a callback for message reception.

void (*cb_connection_established)(struct netio_unbuffered_subscribe_socket*);
void (*cb_connection_closed)(struct netio_unbuffered_subscribe_socket*);
void (*cb_error_connection_refused)(struct netio_unbuffered_subscribe_socket*);
void (*cb_msg_received)(struct netio_unbuffered_subscribe_socket*, netio_tag_t, void*, size_t);

Subscriptions and unsubscriptions are performed by the subscriber with fid granularity using

int netio_unbuffered_subscribe(struct netio_unbuffered_subscribe_socket *socket, netio_tag_t tag)

Subscribe an unbuffered subscribe socket to a given tag.

Parameters:
  • socket – An unbuffered subscribe socket

  • tag – A netio tag

int netio_unbuffered_unsubscribe(struct netio_unbuffered_subscribe_socket *socket, netio_tag_t tag)

Unsubscribe from a given message tag.

For a given subscribe socket, netio_unsubscribe can be called multiple times.

Parameters:
  • socket – The unbuffered subscribe socket.

  • tag – The tag to unsubscribe from.

(Un)subscriptions happen by exchanging messages using the same network protocol used to exchange data. (Un)subscription messages are sent by the send socket included in the netio_unbuffered_subscribe_socket and contain a struct netio_subscription_message each. (Un)subscription messages are received by a netio_recv_socket spawned by the netio_listen_socket included in the netio_unbuffered_publish_socket on connection request. Allocation and registratiuon of buffers for subscriptions is not exposed.

Note

If a remote client unsubscribes from all fids it was subscribed to, the publisher closes the connection. Inside NetIO-next the procedure unfolds as follows: the send socket assigned to the remote client and belonging to the publish socket sends an shutdown message (FI_SHUTDOWN) that is received by the the receiving socket of the subscribe socket. The subscribe socket echoes the shutdown via its send socket used for subscription messages causing the closure of the receiving socket of the publisher socket. This exchange of shutdown messages ensures that all resources associated to a closed connection are freed.

Data is published with netio_unbuffered_publishv_usr.

int netio_unbuffered_publishv_usr(struct netio_unbuffered_publish_socket *socket, netio_tag_t tag, struct iovec *iov, size_t count, uint64_t *key, int flags, struct netio_subscription_cache *cache, uint64_t usr, uint8_t usr_size)

Publishes a message on an unbuffered publish socket.

The message is given as a scatter/gather buffer (struct iovec). The caller has to ensure the validity of the buffer until the transfer is complete. A transfer is complete when the socket’s msg_published callback has been called. A key can be passed to the call to identify the publication. The key will be passed in the msg_published callback.

The msg_published callback will only be called if the message has been sent successfully to all subscribed endpoints.

The call may return NETIO_STATUS_AGAIN if one of the sockets connections yields NETIO_STATUS_AGAIN. In this case it is the user’s responsibility to call netio_unbuffered_publishv again with the NETIO_REENTRY flag.

Parameters:
  • socket – The socket to publish on

  • tag – The tag under which to publish

  • iov – Message data iov

  • count – IOV count

  • key – Key that will be passed to the callback on successful publish of the message. This is an input-output parameter. In case the function returns NETIO_STATUS_PARTIAL, ‘key’ is used as storage to track the completion data for the given tag. If netio_unbuffered_publishv is called again with the NETIO_REENTRY flag, ‘key’ must remain unchanged. In other words, for a given tag, ‘key’ is only set by the user before the initial call to netio_unbuffered_publishv without the NETIO_REENTRY flag.

  • flags – NETIO_REENTRY publishing of this message was attempted before and resulted in NETIO_STATUS_AGAIN. Calling publish with this flag will only send on connections where the message was previously unpublished.

  • cache – Optional user-supplied cache for the subsctiption table lookup.

  • usr – Up to 8 byte of data that are transmitted as beginning of the message. This allows the user to add a short header to a message without having to allocate bufferspace for it.

  • usr_size – Size of the usr header field. Set to 0 if no header is required. The maximum header size is 8.

Returns:

NETIO_STATUS_OK If the message was published successfully to all subscribed endpoints

Returns:

NETIO_STATUS_OK_NOSUB No ongoing subscriptions to publish the given message

Returns:

NETIO_STATUS_AGAIN If not enough resources are available to proceed with the operation. No data were sent to any endpoint. The user should try again with the exact same parameters.

Returns:

NETIO_STATUS_PARTIAL The message was sent to some of the subscribed endpoints, but not all. The user should try again, and additionally set the NETIO_REENTRY flag. Users must take care not to overwrite the key parameter, which is used by the function call to track the operation status.

Returns:

NETIO_ERROR_MAX_IOV_EXCEEDED Too many iovec entries, try with less.

The usr field allows the user to insert up to 8 bytes of user data at the beginning of the message. This is useful to add a small header as part of a user-defined communication protocol. The user data do not need to be part of a registered netio buffer. In the context of FELIX the usr field is used to carry the status byte header that indicates error conditions.

The input/output parameter key is used to track the completion of publish operations. A publish operation can trigger send operations on multiple connections and can therefore produce multiple completion events in the RDMA hardware. To keep count of the number of received completions, NetIO-next uses the 8-byte indicated by key. The key is passed to the CO used for the send operation; in this way as the send operation completes a CO containing key is notified. A stack of COs is allocated in netio_unbuffered_publish_socket_init and it is common to all netio_send_socket belonging to the same netio_unbuffered_publish and sending data to different remote endpoints. COs are re-used: sending a message requires popping one CO from the stack, while cb_msg_published returned a CO to the stack. The size of the completion stack determines how many outstanding messages there can be at a given time.

The return value of the unbuffered publish call is important. If NETIO_STATUS_PARTIAL is returned, data was sent successfully only to a subset of the subscribed nodes. This means users need to call the publish function again in the future with the same parameters, and the NETIO_REENTRY flag set. NETIO_STATUS_AGAIN means that there were not enough resources to process the operation. The user should issue the same publish call again, but in this case without the NETIO_REENTRY flag. The callback cb_msg_published is invoked only when data was sent to all subscribed remote endpoints. The object that internally keeps track of successfull deliveries and triggers cb_msg_published is netio_semaphore and it is used only in unbuffered mode.

Note

In felix-tohost the key field is used to store the address of a message in the FELIX DMA buffer. Once the message has been delivered to all subscribers the semaphore triggers cb_msg_published that contains the CO carrying the key. The value of key is used to advance the FELIX DMA read pointer allowing firmware to write that location.