Name |
Date |
Size |
#Lines |
LOC |
||
---|---|---|---|---|---|---|
.. | - | - | ||||
Kconfig | D | 22-Nov-2024 | 809 | 28 | 22 | |
Makefile | D | 22-Nov-2024 | 496 | 22 | 14 | |
README | D | 22-Nov-2024 | 10.2 KiB | 214 | 176 | |
rtrs-clt-stats.c | D | 22-Nov-2024 | 4.6 KiB | 199 | 144 | |
rtrs-clt-sysfs.c | D | 22-Nov-2024 | 13 KiB | 515 | 413 | |
rtrs-clt-trace.c | D | 22-Nov-2024 | 337 | 16 | 4 | |
rtrs-clt-trace.h | D | 22-Nov-2024 | 2.4 KiB | 87 | 64 | |
rtrs-clt.c | D | 22-Nov-2024 | 83.9 KiB | 3,208 | 2,255 | |
rtrs-clt.h | D | 22-Nov-2024 | 6.7 KiB | 253 | 194 | |
rtrs-log.h | D | 22-Nov-2024 | 954 | 29 | 17 | |
rtrs-pri.h | D | 22-Nov-2024 | 10.8 KiB | 409 | 256 | |
rtrs-srv-stats.c | D | 22-Nov-2024 | 1.3 KiB | 52 | 33 | |
rtrs-srv-sysfs.c | D | 22-Nov-2024 | 7.9 KiB | 320 | 248 | |
rtrs-srv-trace.c | D | 22-Nov-2024 | 359 | 17 | 5 | |
rtrs-srv-trace.h | D | 22-Nov-2024 | 2.2 KiB | 89 | 68 | |
rtrs-srv.c | D | 22-Nov-2024 | 57.8 KiB | 2,347 | 1,840 | |
rtrs-srv.h | D | 22-Nov-2024 | 3.9 KiB | 157 | 116 | |
rtrs.c | D | 22-Nov-2024 | 15.1 KiB | 643 | 484 | |
rtrs.h | D | 22-Nov-2024 | 5.3 KiB | 189 | 80 |
README
1 **************************** 2 RDMA Transport (RTRS) 3 **************************** 4 5 RTRS (RDMA Transport) is a reliable high speed transport library 6 which provides support to establish optimal number of connections 7 between client and server machines using RDMA (InfiniBand, RoCE, iWarp) 8 transport. It is optimized to transfer (read/write) IO blocks. 9 10 In its core interface it follows the BIO semantics of providing the 11 possibility to either write data from an sg list to the remote side 12 or to request ("read") data transfer from the remote side into a given 13 sg list. 14 15 RTRS provides I/O fail-over and load-balancing capabilities by using 16 multipath I/O (see "add_path" and "mp_policy" configuration entries in 17 Documentation/ABI/testing/sysfs-class-rtrs-client). 18 19 RTRS is used by the RNBD (RDMA Network Block Device) modules. 20 21 ================== 22 Transport protocol 23 ================== 24 25 Overview 26 -------- 27 An established connection between a client and a server is called rtrs 28 session. A session is associated with a set of memory chunks reserved on the 29 server side for a given client for rdma transfer. A session 30 consists of multiple paths, each representing a separate physical link 31 between client and server. Those are used for load balancing and failover. 32 Each path consists of as many connections (QPs) as there are cpus on 33 the client. 34 35 When processing an incoming write or read request, rtrs client uses memory 36 chunks reserved for him on the server side. Their number, size and addresses 37 need to be exchanged between client and server during the connection 38 establishment phase. Apart from the memory related information client needs to 39 inform the server about the session name and identify each path and connection 40 individually. 41 42 On an established session client sends to server write or read messages. 43 Server uses immediate field to tell the client which request is being 44 acknowledged and for errno. Client uses immediate field to tell the server 45 which of the memory chunks has been accessed and at which offset the message 46 can be found. 47 48 Module parameter always_invalidate is introduced for the security problem 49 discussed in LPC RDMA MC 2019. When always_invalidate=Y, on the server side we 50 invalidate each rdma buffer before we hand it over to RNBD server and 51 then pass it to the block layer. A new rkey is generated and registered for the 52 buffer after it returns back from the block layer and RNBD server. 53 The new rkey is sent back to the client along with the IO result. 54 The procedure is the default behaviour of the driver. This invalidation and 55 registration on each IO causes performance drop of up to 20%. A user of the 56 driver may choose to load the modules with this mechanism switched off 57 (always_invalidate=N), if he understands and can take the risk of a malicious 58 client being able to corrupt memory of a server it is connected to. This might 59 be a reasonable option in a scenario where all the clients and all the servers 60 are located within a secure datacenter. 61 62 63 Connection establishment 64 ------------------------ 65 66 1. Client starts establishing connections belonging to a path of a session one 67 by one via attaching RTRS_MSG_CON_REQ messages to the rdma_connect requests. 68 Those include uuid of the session and uuid of the path to be 69 established. They are used by the server to find a persisting session/path or 70 to create a new one when necessary. The message also contains the protocol 71 version and magic for compatibility, total number of connections per session 72 (as many as cpus on the client), the id of the current connection and 73 the reconnect counter, which is used to resolve the situations where 74 client is trying to reconnect a path, while server is still destroying the old 75 one. 76 77 2. Server accepts the connection requests one by one and attaches 78 RTRS_MSG_CONN_RSP messages to the rdma_accept. Apart from magic and 79 protocol version, the messages include error code, queue depth supported by 80 the server (number of memory chunks which are going to be allocated for that 81 session) and the maximum size of one io, RTRS_MSG_NEW_RKEY_F flags is set 82 when always_invalidate=Y. 83 84 3. After all connections of a path are established client sends to server the 85 RTRS_MSG_INFO_REQ message, containing the name of the session. This message 86 requests the address information from the server. 87 88 4. Server replies to the session info request message with RTRS_MSG_INFO_RSP, 89 which contains the addresses and keys of the RDMA buffers allocated for that 90 session. 91 92 5. Session becomes connected after all paths to be established are connected 93 (i.e. steps 1-4 finished for all paths requested for a session) 94 95 6. Server and client exchange periodically heartbeat messages (empty rdma 96 messages with an immediate field) which are used to detect a crash on remote 97 side or network outage in an absence of IO. 98 99 7. On any RDMA related error or in the case of a heartbeat timeout, the 100 corresponding path is disconnected, all the inflight IO are failed over to a 101 healthy path, if any, and the reconnect mechanism is triggered. 102 103 CLT SRV 104 *for each connection belonging to a path and for each path: 105 RTRS_MSG_CON_REQ -------------------> 106 <------------------- RTRS_MSG_CON_RSP 107 ... 108 *after all connections are established: 109 RTRS_MSG_INFO_REQ -------------------> 110 <------------------- RTRS_MSG_INFO_RSP 111 *heartbeat is started from both sides: 112 -------------------> [RTRS_HB_MSG_IMM] 113 [RTRS_HB_MSG_ACK] <------------------- 114 [RTRS_HB_MSG_IMM] <------------------- 115 -------------------> [RTRS_HB_MSG_ACK] 116 117 IO path 118 ------- 119 120 * Write (always_invalidate=N) * 121 122 1. When processing a write request client selects one of the memory chunks 123 on the server side and rdma writes there the user data, user header and the 124 RTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only 125 contains size of the user header. The client tells the server which chunk has 126 been accessed and at what offset the RTRS_MSG_RDMA_WRITE can be found by 127 using the IMM field. 128 129 2. When confirming a write request server sends an "empty" rdma message with 130 an immediate field. The 32 bit field is used to specify the outstanding 131 inflight IO and for the error code. 132 133 CLT SRV 134 usr_data + usr_hdr + rtrs_msg_rdma_write -----------------> [RTRS_IO_REQ_IMM] 135 [RTRS_IO_RSP_IMM] <----------------- (id + errno) 136 137 * Write (always_invalidate=Y) * 138 139 1. When processing a write request client selects one of the memory chunks 140 on the server side and rdma writes there the user data, user header and the 141 RTRS_MSG_RDMA_WRITE message. Apart from the type (write), the message only 142 contains size of the user header. The client tells the server which chunk has 143 been accessed and at what offset the RTRS_MSG_RDMA_WRITE can be found by 144 using the IMM field, Server invalidate rkey associated to the memory chunks 145 first, when it finishes, pass the IO to RNBD server module. 146 147 2. When confirming a write request server sends an "empty" rdma message with 148 an immediate field. The 32 bit field is used to specify the outstanding 149 inflight IO and for the error code. The new rkey is sent back using 150 SEND_WITH_IMM WR, client When it recived new rkey message, it validates 151 the message and finished IO after update rkey for the rbuffer, then post 152 back the recv buffer for later use. 153 154 CLT SRV 155 usr_data + usr_hdr + rtrs_msg_rdma_write -----------------> [RTRS_IO_REQ_IMM] 156 [RTRS_MSG_RKEY_RSP] <----------------- (RTRS_MSG_RKEY_RSP) 157 [RTRS_IO_RSP_IMM] <----------------- (id + errno) 158 159 160 * Read (always_invalidate=N)* 161 162 1. When processing a read request client selects one of the memory chunks 163 on the server side and rdma writes there the user header and the 164 RTRS_MSG_RDMA_READ message. This message contains the type (read), size of 165 the user header, flags (specifying if memory invalidation is necessary) and the 166 list of addresses along with keys for the data to be read into. 167 168 2. When confirming a read request server transfers the requested data first, 169 attaches an invalidation message if requested and finally an "empty" rdma 170 message with an immediate field. The 32 bit field is used to specify the 171 outstanding inflight IO and the error code. 172 173 CLT SRV 174 usr_hdr + rtrs_msg_rdma_read --------------> [RTRS_IO_REQ_IMM] 175 [RTRS_IO_RSP_IMM] <-------------- usr_data + (id + errno) 176 or in case client requested invalidation: 177 [RTRS_IO_RSP_IMM_W_INV] <-------------- usr_data + (INV) + (id + errno) 178 179 * Read (always_invalidate=Y)* 180 181 1. When processing a read request client selects one of the memory chunks 182 on the server side and rdma writes there the user header and the 183 RTRS_MSG_RDMA_READ message. This message contains the type (read), size of 184 the user header, flags (specifying if memory invalidation is necessary) and the 185 list of addresses along with keys for the data to be read into. 186 Server invalidate rkey associated to the memory chunks first, when it finishes, 187 passes the IO to RNBD server module. 188 189 2. When confirming a read request server transfers the requested data first, 190 attaches an invalidation message if requested and finally an "empty" rdma 191 message with an immediate field. The 32 bit field is used to specify the 192 outstanding inflight IO and the error code. The new rkey is sent back using 193 SEND_WITH_IMM WR, client When it recived new rkey message, it validates 194 the message and finished IO after update rkey for the rbuffer, then post 195 back the recv buffer for later use. 196 197 CLT SRV 198 usr_hdr + rtrs_msg_rdma_read --------------> [RTRS_IO_REQ_IMM] 199 [RTRS_IO_RSP_IMM] <-------------- usr_data + (id + errno) 200 [RTRS_MSG_RKEY_RSP] <----------------- (RTRS_MSG_RKEY_RSP) 201 or in case client requested invalidation: 202 [RTRS_IO_RSP_IMM_W_INV] <-------------- usr_data + (INV) + (id + errno) 203 ========================================= 204 Contributors List(in alphabetical order) 205 ========================================= 206 Danil Kipnis <danil.kipnis@profitbricks.com> 207 Fabian Holler <mail@fholler.de> 208 Guoqing Jiang <guoqing.jiang@cloud.ionos.com> 209 Jack Wang <jinpu.wang@profitbricks.com> 210 Kleber Souza <kleber.souza@profitbricks.com> 211 Lutz Pogrell <lutz.pogrell@cloud.ionos.com> 212 Milind Dumbare <Milind.dumbare@gmail.com> 213 Roman Penyaev <roman.penyaev@profitbricks.com> 214