.. _UnbufferedCommunication:

Unbuffered RDMA Communication
=============================

The main advantage of RDMA network communication is the
support for zero-copy network transfers, i.e. data transfers from the user
space of one node to the user space of another node without any memory copies.
In other words, on the sender side  the network card reads the message from the
host memory location, on the receiving it writes the incoming message in a
predefined memory location. 

In the context of NetIO-next zero-copy communication is referred to as "unbuffered".
Unbuffered communication is convenient for workloads characterised by large
messages (>O(kB)). If instead the workload is characterised by a high rate
if small (O(byte)) messages data coalescence is more convenient: messages are
copied in larger buffers and transferred once full.
Data coalescence is supported by NetIO-next and is referred to as buffered
communication, see :ref:`BufferedCommunication`.

For both buffered and unbuffered communication, two messaging patterns are
supported: point-to-point and publish-subscribe.
The former is intended for communication between two remote endpoints and is
typically used to send data to the FELIX PC, the latter allows remote endpoints
to receive streams of data on demand, upon subscription.


Point-to-Point Communication
----------------------------

NetIO-next supports unidirectional unbuffered point-to-point communication
using socket types:

- *send sockets* (`struct netio_send_socket`): the sending side of a connection.
- *listen sockets* (`struct netio_listen_socket`): listen for incoming connections and creates receive sockets to form connection pairs.
- *receive sockets* (`struct netio_recv_socket`): the receiving side of a connection, created by a listen socket.


Socket Initialization
.....................

Receive sockets are created by listen sockets when a connection is opened,
so users need to manually create and initialize only send and listen sockets.
For initialization of these, the following functions are used:

.. doxygenfunction::  netio_init_send_socket
  :no-link:

.. doxygenfunction::  netio_init_listen_socket
  :no-link:

Unbuffered sockets support a set of user-definable callbacks for different connection events.
For send sockets, the following callbacks are supported::

  void (*cb_connection_established)(struct netio_send_socket*);
  void (*cb_connection_closed)(struct netio_send_socket*);
  void (*cb_internal_connection_closed)(struct netio_send_socket*);
  void (*cb_send_completed)(struct netio_send_socket*, uint64_t key);
  void (*cb_error_connection_refused)(struct netio_send_socket*);

`cb_connection_established` and `connection closed` update the application about
the connection status. `cb_internal_connection_closed` is a callback internal
used to clear resources in the appropriate way depending on whether the `send_socket`
is associated to a `netio_unbuffered_publish_socket` or not.
`cb_send_completed` reports that a send operation successfully completed and
the memory location that stored the message can be reused.
In the felix-tohost FELIX readout application this function is used to report
that data written by the FELIX card has been sent and the FELIX card is
free to overwrite it.

For listen sockets, the set of supported callbacks is::

  void (*cb_connection_established)(struct netio_recv_socket*);
  void (*cb_connection_closed)(struct netio_recv_socket*);
  void (*cb_msg_received)(struct netio_recv_socket*, struct netio_buffer*, void*, size_t);
  void (*cb_msg_imm_received)(struct netio_recv_socket*, struct netio_buffer*, void*, size_t, uint64_t);
  void (*cb_error_bind_refused)(struct netio_listen_socket*);

The callback `cb_msg_received` notifies the receiver that a message length `s`
located at `data` has been received on buffer `b` of receving socket `r`.
`cb_msg_imm_received` is the same as `cb_msg_received` except for an extra
8-byte argument used to receive *immediate data*, described below in the
context of send functions.
As all other callbacks, `cb_msg_received` is run by the event loop thread. 


Memory Management
.................

As mentioned in the previous section MRs used for sending or receving messages
need to be pinned. 
NetIO-next distinguishes between memory that is used for send operations and
memory that is used for receive operations. Any ordinary user space buffer
can be registered [#]_. The following two functions can be used for the
memory registration:

.. doxygenfunction::  netio_register_send_buffer
  :no-link:

.. doxygenfunction::  netio_register_recv_buffer
  :no-link:

The explicit use of `netio_register_recv_buffer` is being deprecated in favour
of a buffer allocation and registration in `netio_init_listen_socket` via
attributes passed to the function.


Establishing a Connection
.........................

Listen sockets need to be bound to a network interface and put into listening mode:

.. doxygenfunction:: netio_listen
  :no-link:

Then send sockets can connect:

.. doxygenfunction:: netio_connect
  :no-link:

Send sockets can also disconnect from an established connection:

.. doxygenfunction:: netio_disconnect
  :no-link:


Sending and Receiving Data
..........................

Starting from the receiving side, first the receiver has to post one or more
buffers to the socket which will be used to receive data:

.. doxygenfunction:: netio_post_recv
  :no-link:

Naturally, the receive buffer needs to be previously registered using the
memory registration function described above.
Upon message reception, a signal from the CC associated to the receiving socket,
will invoke the `cb_msg_received` callback.

On the sending side there is more than one option to send a message::

  int netio_send_buffer(struct netio_send_socket* socket, struct netio_buffer* buf);
  int netio_send(struct netio_send_socket* socket, struct netio_buffer* buf, void* addr, size_t size, uint64_t key);
  int netio_send_imm(struct netio_send_socket* socket, struct netio_buffer* buf, void* addr, size_t size, uint64_t key, uint64_t imm);
  int netio_sendv(struct netio_send_socket* socket, struct netio_buffer** buf, struct iovec* iov, size_t count, uint64_t key);
  int netio_sendv_imm(struct netio_send_socket* socket, struct netio_buffer** buf, struct iovec* iov, size_t count, uint64_t key, uint64_t imm);

The simplest function is `netio_send_buffer`. As the name suggest, this function
sends a full buffer to the remote endpoint.

.. doxygenfunction:: netio_send_buffer
  :no-link:

When the buffer is successfully transmitted to the remote, the
`cb_send_completed` callback will be called as a response to a CO written by
the network stack in the CQ associated to the send socket.
This callback has two parameters, `socket` and `key`.
The first parameter refers the send socket that issued the send operation.
The `key` parameter is used by the user to identify the individual send operation
that completed. In case of the `netio_send_buffer` method,
`key` will be set to the address of the buffer that was sent.
All other send operations include a parameter `key` that will be
passed back into the completion callback. The key can be set freely by the user.

.. note::
  The specific condition that triggers the creation of a CO on the sender
  side is FI_INJECT_COMPLETE. Other conditions exist and are listed in
  `libfabric's documentaion <https://ofiwg.github.io/libfabric/v1.16.1/man/fi_endpoint.3.html>`_ 
  

The next send function is `netio_send`:

.. doxygenfunction:: netio_send
  :no-link:

This function does not transmit a complete buffer, but only a sub-region of
this buffer. The sub-region is identified by the parameter `data` and `size`.
In addition to the data in the buffer the user can pass a few bytes of extra data 
to the remote. This is called `immediate data` and will be transported as part
of the underlying protocol headers. 
If the user wants to make use of immediate data, the function to use is 

.. doxygenfunction:: netio_send_imm
  :no-link:

On the receiving side, messages with immediate data are received by calling
the second receive callback, `cb_msg_imm_received`, 
which includes the additional `imm` parameter. Messages without immediate data
result in a call of the `cb_msg_received` callback. If a message with immediate
data is received, but `cb_msg_imm_received` is not specified (NULL),
`cb_msg_received` will be called instead and the immediate data will be dropped.

Both `netio_send` and `netio_send_imm` come in versions that allow the use of
a scatter/gather vector instead of a single message pointer. As such, they
require a vector of buffers as well. Users must take care that every entry in
the IO vector is fully contained within the associated send buffer (i.e. all
entried resider in a memory regions). The function declaration are shown below.

.. doxygenfunction:: netio_sendv
  :no-link:

.. doxygenfunction:: netio_sendv_imm
  :no-link:

.. note::
  The scatter/gather vector has a maximum number of entries. This number is
  defaulted by libfabric to a 4 but it can be changed setting the
  environment variables FI_VERBS_TX_IOV_LIMIT and FI_VERBS_RX_IOV_LIMIT (to be
  set to the same values on both sending and receiving side). The hard limit
  depends on hardware and for Nvidia Connect-X5 is 30. The corresponding hardware
  limits are called `max_srq_sge` and `max_sge` and can be probed with the
  command `ibv_devinfo -v`.  NetIO-next contains a hardcododed limit
  NETIO_MAX_IOV_LEN set to 28.

.. warning::
  The buffers posted by the receiver side need to be large enough to accomodate
  the inbound messages. If this is not the case, on the occurrence of the first
  non-fitting message the sender CQ will move to error state and will not allow
  further messages to be sent. 


Publish/Subscribe Communication
-------------------------------

Publish/Subscribe is a communication pattern in which a publisher sends messages
to a dynamic list of subscribers. Messages are categorized in streams using
numeric message tags called felix identifiers (fid) to which subscribers can
subscribe. A subscriber can subscribe to one or many different fid of a publisher.

The publisher maintains an internal subscription table which contains
connections to the various subscribers. Connection management is automatic and
publishers do not need to connect to (or even be aware of) any subscribers.

The publish/subscribe communication pattern in unbuffered mode works as follows.
First a publish socket needs to be initialised on the sender side and
a subscribed socket on the receving side.

.. doxygenfunction:: netio_unbuffered_publish_socket_init
  :no-link:

.. doxygenfunction:: netio_unbuffered_subscribe_socket_init
  :no-link:

On the publisher side a single send buffer is passed: the use case in mind is
to use the large send buffer as data pool and then send over the network messages
consisting of scatter/gather vectors pointing to one or more locations.
On the subscriber side a number and size of buffers need to be passed: the
allocation is done internally.

The publish socket has the following callbacks available for the application.
These callbacks are exposed to the application and allow to perform operations
upon subscription, connection and successfull transmission events.::

  void (*cb_subscribe)(struct netio_unbuffered_publish_socket*, netio_tag_t, void*, size_t);
  void (*cb_unsubscribe)(struct netio_unbuffered_publish_socket*, netio_tag_t, void*, size_t);
  void (*cb_connection_established)(struct netio_unbuffered_publish_socket*);
  void (*cb_connection_closed)(struct netio_unbuffered_publish_socket*);
  void (*cb_msg_published)(struct netio_unbuffered_publish_socket*, uint64_t);

The subscribe socket has analogous callbacks for connection events and a
callback for message reception. ::

  void (*cb_connection_established)(struct netio_unbuffered_subscribe_socket*);
  void (*cb_connection_closed)(struct netio_unbuffered_subscribe_socket*);
  void (*cb_error_connection_refused)(struct netio_unbuffered_subscribe_socket*);
  void (*cb_msg_received)(struct netio_unbuffered_subscribe_socket*, netio_tag_t, void*, size_t);

Subscriptions and unsubscriptions are performed by the subscriber with fid
granularity using

.. doxygenfunction:: netio_unbuffered_subscribe
  :no-link:

.. doxygenfunction:: netio_unbuffered_unsubscribe
  :no-link:

(Un)subscriptions happen by exchanging messages using the same network protocol
used to exchange data. (Un)subscription messages are sent by the send socket
included in the `netio_unbuffered_subscribe_socket` and contain a 
`struct netio_subscription_message` each. (Un)subscription messages are received
by a `netio_recv_socket` spawned by the `netio_listen_socket` included in the
`netio_unbuffered_publish_socket` on connection request. Allocation and
registratiuon of buffers for subscriptions is not exposed.

.. note::
  If a remote client unsubscribes from all fids it was subscribed to, the
  publisher closes the connection. Inside NetIO-next the procedure unfolds
  as follows: the send socket assigned to the remote client and belonging to the
  publish socket sends an shutdown message (FI_SHUTDOWN) that is received by the
  the receiving socket of the subscribe socket. The subscribe socket echoes the
  shutdown via its send socket used for subscription messages causing the closure
  of the receiving socket of the publisher socket. This exchange of shutdown
  messages ensures that all resources associated to a closed connection are freed. 


Data is published with `netio_unbuffered_publishv_usr`.

.. doxygenfunction:: netio_unbuffered_publishv_usr
  :no-link:

The `usr` field allows the user to insert up to 8 bytes of user data at the
beginning of the message. This is useful to add a small header as part of a
user-defined communication protocol. The user data do *not* need to be part of
a registered netio buffer. In the context of FELIX the `usr` field is used to
carry the `status byte` header that indicates error conditions.

The input/output parameter `key` is used to track the completion of publish
operations. A publish operation can trigger send operations on multiple
connections and can therefore produce multiple completion events in the RDMA
hardware. To keep count of the number of received completions, NetIO-next
uses the 8-byte indicated by `key`. The `key` is passed to the CO used for the
send operation; in this way as the send operation completes a CO
containing `key` is notified. A stack of COs is allocated in 
`netio_unbuffered_publish_socket_init` and it is common to all `netio_send_socket`
belonging to the same `netio_unbuffered_publish` and sending data to different
remote endpoints. COs are re-used: sending a message requires popping one CO
from the stack, while `cb_msg_published` returned a CO to the stack.
The size of the completion stack determines how many outstanding messages there
can be at a given time.

The return value of the unbuffered publish call is important.
If NETIO_STATUS_PARTIAL is returned, data was sent successfully only 
to a subset of the subscribed nodes. This means users need to call the publish
function again in the future with the same parameters, and the NETIO_REENTRY
flag set. NETIO_STATUS_AGAIN means that there were not enough resources to
process the operation. The user should issue the same publish call again,
but in this case without the NETIO_REENTRY flag.
The callback `cb_msg_published` is invoked only when data was sent to all
subscribed remote endpoints. The object that internally keeps track of
successfull deliveries and triggers `cb_msg_published` is
`netio_semaphore` and it is used only in unbuffered mode.

.. note::
  In felix-tohost the `key` field is used to store the address of a message in
  the FELIX DMA buffer. Once the message has been delivered to all subscribers
  the semaphore triggers `cb_msg_published` that contains the CO carrying the
  `key`. The value of `key` is used to advance the FELIX DMA read pointer
  allowing firmware to write that location.


------------

.. [#] Virtual memory needs to be backed by `struct pages*` in the Linux kernel.
       This is the case for any ordinary memory allocated using `malloc` and
       similar functions. Memory addresses obtained from device drivers that
       perform their own mapping into virtual address space may be problematic.