1.. SPDX-License-Identifier: GPL-2.0
2
3VMBus
4=====
5VMBus is a software construct provided by Hyper-V to guest VMs.  It
6consists of a control path and common facilities used by synthetic
7devices that Hyper-V presents to guest VMs.   The control path is
8used to offer synthetic devices to the guest VM and, in some cases,
9to rescind those devices.   The common facilities include software
10channels for communicating between the device driver in the guest VM
11and the synthetic device implementation that is part of Hyper-V, and
12signaling primitives to allow Hyper-V and the guest to interrupt
13each other.
14
15VMBus is modeled in Linux as a bus, with the expected /sys/bus/vmbus
16entry in a running Linux guest.  The VMBus driver (drivers/hv/vmbus_drv.c)
17establishes the VMBus control path with the Hyper-V host, then
18registers itself as a Linux bus driver.  It implements the standard
19bus functions for adding and removing devices to/from the bus.
20
21Most synthetic devices offered by Hyper-V have a corresponding Linux
22device driver.  These devices include:
23
24* SCSI controller
25* NIC
26* Graphics frame buffer
27* Keyboard
28* Mouse
29* PCI device pass-thru
30* Heartbeat
31* Time Sync
32* Shutdown
33* Memory balloon
34* Key/Value Pair (KVP) exchange with Hyper-V
35* Hyper-V online backup (a.k.a. VSS)
36
37Guest VMs may have multiple instances of the synthetic SCSI
38controller, synthetic NIC, and PCI pass-thru devices.  Other
39synthetic devices are limited to a single instance per VM.  Not
40listed above are a small number of synthetic devices offered by
41Hyper-V that are used only by Windows guests and for which Linux
42does not have a driver.
43
44Hyper-V uses the terms "VSP" and "VSC" in describing synthetic
45devices.  "VSP" refers to the Hyper-V code that implements a
46particular synthetic device, while "VSC" refers to the driver for
47the device in the guest VM.  For example, the Linux driver for the
48synthetic NIC is referred to as "netvsc" and the Linux driver for
49the synthetic SCSI controller is "storvsc".  These drivers contain
50functions with names like "storvsc_connect_to_vsp".
51
52VMBus channels
53--------------
54An instance of a synthetic device uses VMBus channels to communicate
55between the VSP and the VSC.  Channels are bi-directional and used
56for passing messages.   Most synthetic devices use a single channel,
57but the synthetic SCSI controller and synthetic NIC may use multiple
58channels to achieve higher performance and greater parallelism.
59
60Each channel consists of two ring buffers.  These are classic ring
61buffers from a university data structures textbook.  If the read
62and writes pointers are equal, the ring buffer is considered to be
63empty, so a full ring buffer always has at least one byte unused.
64The "in" ring buffer is for messages from the Hyper-V host to the
65guest, and the "out" ring buffer is for messages from the guest to
66the Hyper-V host.  In Linux, the "in" and "out" designations are as
67viewed by the guest side.  The ring buffers are memory that is
68shared between the guest and the host, and they follow the standard
69paradigm where the memory is allocated by the guest, with the list
70of GPAs that make up the ring buffer communicated to the host.  Each
71ring buffer consists of a header page (4 Kbytes) with the read and
72write indices and some control flags, followed by the memory for the
73actual ring.  The size of the ring is determined by the VSC in the
74guest and is specific to each synthetic device.   The list of GPAs
75making up the ring is communicated to the Hyper-V host over the
76VMBus control path as a GPA Descriptor List (GPADL).  See function
77vmbus_establish_gpadl().
78
79Each ring buffer is mapped into contiguous Linux kernel virtual
80space in three parts:  1) the 4 Kbyte header page, 2) the memory
81that makes up the ring itself, and 3) a second mapping of the memory
82that makes up the ring itself.  Because (2) and (3) are contiguous
83in kernel virtual space, the code that copies data to and from the
84ring buffer need not be concerned with ring buffer wrap-around.
85Once a copy operation has completed, the read or write index may
86need to be reset to point back into the first mapping, but the
87actual data copy does not need to be broken into two parts.  This
88approach also allows complex data structures to be easily accessed
89directly in the ring without handling wrap-around.
90
91On arm64 with page sizes > 4 Kbytes, the header page must still be
92passed to Hyper-V as a 4 Kbyte area.  But the memory for the actual
93ring must be aligned to PAGE_SIZE and have a size that is a multiple
94of PAGE_SIZE so that the duplicate mapping trick can be done.  Hence
95a portion of the header page is unused and not communicated to
96Hyper-V.  This case is handled by vmbus_establish_gpadl().
97
98Hyper-V enforces a limit on the aggregate amount of guest memory
99that can be shared with the host via GPADLs.  This limit ensures
100that a rogue guest can't force the consumption of excessive host
101resources.  For Windows Server 2019 and later, this limit is
102approximately 1280 Mbytes.  For versions prior to Windows Server
1032019, the limit is approximately 384 Mbytes.
104
105VMBus channel messages
106----------------------
107All messages sent in a VMBus channel have a standard header that includes
108the message length, the offset of the message payload, some flags, and a
109transactionID.  The portion of the message after the header is
110unique to each VSP/VSC pair.
111
112Messages follow one of two patterns:
113
114* Unidirectional:  Either side sends a message and does not
115  expect a response message
116* Request/response:  One side (usually the guest) sends a message
117  and expects a response
118
119The transactionID (a.k.a. "requestID") is for matching requests &
120responses.  Some synthetic devices allow multiple requests to be in-
121flight simultaneously, so the guest specifies a transactionID when
122sending a request.  Hyper-V sends back the same transactionID in the
123matching response.
124
125Messages passed between the VSP and VSC are control messages.  For
126example, a message sent from the storvsc driver might be "execute
127this SCSI command".   If a message also implies some data transfer
128between the guest and the Hyper-V host, the actual data to be
129transferred may be embedded with the control message, or it may be
130specified as a separate data buffer that the Hyper-V host will
131access as a DMA operation.  The former case is used when the size of
132the data is small and the cost of copying the data to and from the
133ring buffer is minimal.  For example, time sync messages from the
134Hyper-V host to the guest contain the actual time value.  When the
135data is larger, a separate data buffer is used.  In this case, the
136control message contains a list of GPAs that describe the data
137buffer.  For example, the storvsc driver uses this approach to
138specify the data buffers to/from which disk I/O is done.
139
140Three functions exist to send VMBus channel messages:
141
1421. vmbus_sendpacket():  Control-only messages and messages with
143   embedded data -- no GPAs
1442. vmbus_sendpacket_pagebuffer(): Message with list of GPAs
145   identifying data to transfer.  An offset and length is
146   associated with each GPA so that multiple discontinuous areas
147   of guest memory can be targeted.
1483. vmbus_sendpacket_mpb_desc(): Message with list of GPAs
149   identifying data to transfer.  A single offset and length is
150   associated with a list of GPAs.  The GPAs must describe a
151   single logical area of guest memory to be targeted.
152
153Historically, Linux guests have trusted Hyper-V to send well-formed
154and valid messages, and Linux drivers for synthetic devices did not
155fully validate messages.  With the introduction of processor
156technologies that fully encrypt guest memory and that allow the
157guest to not trust the hypervisor (AMD SEV-SNP, Intel TDX), trusting
158the Hyper-V host is no longer a valid assumption.  The drivers for
159VMBus synthetic devices are being updated to fully validate any
160values read from memory that is shared with Hyper-V, which includes
161messages from VMBus devices.  To facilitate such validation,
162messages read by the guest from the "in" ring buffer are copied to a
163temporary buffer that is not shared with Hyper-V.  Validation is
164performed in this temporary buffer without the risk of Hyper-V
165maliciously modifying the message after it is validated but before
166it is used.
167
168Synthetic Interrupt Controller (synic)
169--------------------------------------
170Hyper-V provides each guest CPU with a synthetic interrupt controller
171that is used by VMBus for host-guest communication. While each synic
172defines 16 synthetic interrupts (SINT), Linux uses only one of the 16
173(VMBUS_MESSAGE_SINT). All interrupts related to communication between
174the Hyper-V host and a guest CPU use that SINT.
175
176The SINT is mapped to a single per-CPU architectural interrupt (i.e,
177an 8-bit x86/x64 interrupt vector, or an arm64 PPI INTID). Because
178each CPU in the guest has a synic and may receive VMBus interrupts,
179they are best modeled in Linux as per-CPU interrupts. This model works
180well on arm64 where a single per-CPU Linux IRQ is allocated for
181VMBUS_MESSAGE_SINT. This IRQ appears in /proc/interrupts as an IRQ labelled
182"Hyper-V VMbus". Since x86/x64 lacks support for per-CPU IRQs, an x86
183interrupt vector is statically allocated (HYPERVISOR_CALLBACK_VECTOR)
184across all CPUs and explicitly coded to call vmbus_isr(). In this case,
185there's no Linux IRQ, and the interrupts are visible in aggregate in
186/proc/interrupts on the "HYP" line.
187
188The synic provides the means to demultiplex the architectural interrupt into
189one or more logical interrupts and route the logical interrupt to the proper
190VMBus handler in Linux. This demultiplexing is done by vmbus_isr() and
191related functions that access synic data structures.
192
193The synic is not modeled in Linux as an irq chip or irq domain,
194and the demultiplexed logical interrupts are not Linux IRQs. As such,
195they don't appear in /proc/interrupts or /proc/irq. The CPU
196affinity for one of these logical interrupts is controlled via an
197entry under /sys/bus/vmbus as described below.
198
199VMBus interrupts
200----------------
201VMBus provides a mechanism for the guest to interrupt the host when
202the guest has queued new messages in a ring buffer.  The host
203expects that the guest will send an interrupt only when an "out"
204ring buffer transitions from empty to non-empty.  If the guest sends
205interrupts at other times, the host deems such interrupts to be
206unnecessary.  If a guest sends an excessive number of unnecessary
207interrupts, the host may throttle that guest by suspending its
208execution for a few seconds to prevent a denial-of-service attack.
209
210Similarly, the host will interrupt the guest via the synic when
211it sends a new message on the VMBus control path, or when a VMBus
212channel "in" ring buffer transitions from empty to non-empty due to
213the host inserting a new VMBus channel message. The control message stream
214and each VMBus channel "in" ring buffer are separate logical interrupts
215that are demultiplexed by vmbus_isr(). It demultiplexes by first checking
216for channel interrupts by calling vmbus_chan_sched(), which looks at a synic
217bitmap to determine which channels have pending interrupts on this CPU.
218If multiple channels have pending interrupts for this CPU, they are
219processed sequentially.  When all channel interrupts have been processed,
220vmbus_isr() checks for and processes any messages received on the VMBus
221control path.
222
223The guest CPU that a VMBus channel will interrupt is selected by the
224guest when the channel is created, and the host is informed of that
225selection.  VMBus devices are broadly grouped into two categories:
226
2271. "Slow" devices that need only one VMBus channel.  The devices
228   (such as keyboard, mouse, heartbeat, and timesync) generate
229   relatively few interrupts.  Their VMBus channels are all
230   assigned to interrupt the VMBUS_CONNECT_CPU, which is always
231   CPU 0.
232
2332. "High speed" devices that may use multiple VMBus channels for
234   higher parallelism and performance.  These devices include the
235   synthetic SCSI controller and synthetic NIC.  Their VMBus
236   channels interrupts are assigned to CPUs that are spread out
237   among the available CPUs in the VM so that interrupts on
238   multiple channels can be processed in parallel.
239
240The assignment of VMBus channel interrupts to CPUs is done in the
241function init_vp_index().  This assignment is done outside of the
242normal Linux interrupt affinity mechanism, so the interrupts are
243neither "unmanaged" nor "managed" interrupts.
244
245The CPU that a VMBus channel will interrupt can be seen in
246/sys/bus/vmbus/devices/<deviceGUID>/ channels/<channelRelID>/cpu.
247When running on later versions of Hyper-V, the CPU can be changed
248by writing a new value to this sysfs entry. Because VMBus channel
249interrupts are not Linux IRQs, there are no entries in /proc/interrupts
250or /proc/irq corresponding to individual VMBus channel interrupts.
251
252An online CPU in a Linux guest may not be taken offline if it has
253VMBus channel interrupts assigned to it.  Any such channel
254interrupts must first be manually reassigned to another CPU as
255described above.  When no channel interrupts are assigned to the
256CPU, it can be taken offline.
257
258The VMBus channel interrupt handling code is designed to work
259correctly even if an interrupt is received on a CPU other than the
260CPU assigned to the channel.  Specifically, the code does not use
261CPU-based exclusion for correctness.  In normal operation, Hyper-V
262will interrupt the assigned CPU.  But when the CPU assigned to a
263channel is being changed via sysfs, the guest doesn't know exactly
264when Hyper-V will make the transition.  The code must work correctly
265even if there is a time lag before Hyper-V starts interrupting the
266new CPU.  See comments in target_cpu_store().
267
268VMBus device creation/deletion
269------------------------------
270Hyper-V and the Linux guest have a separate message-passing path
271that is used for synthetic device creation and deletion. This
272path does not use a VMBus channel.  See vmbus_post_msg() and
273vmbus_on_msg_dpc().
274
275The first step is for the guest to connect to the generic
276Hyper-V VMBus mechanism.  As part of establishing this connection,
277the guest and Hyper-V agree on a VMBus protocol version they will
278use.  This negotiation allows newer Linux kernels to run on older
279Hyper-V versions, and vice versa.
280
281The guest then tells Hyper-V to "send offers".  Hyper-V sends an
282offer message to the guest for each synthetic device that the VM
283is configured to have. Each VMBus device type has a fixed GUID
284known as the "class ID", and each VMBus device instance is also
285identified by a GUID. The offer message from Hyper-V contains
286both GUIDs to uniquely (within the VM) identify the device.
287There is one offer message for each device instance, so a VM with
288two synthetic NICs will get two offers messages with the NIC
289class ID. The ordering of offer messages can vary from boot-to-boot
290and must not be assumed to be consistent in Linux code. Offer
291messages may also arrive long after Linux has initially booted
292because Hyper-V supports adding devices, such as synthetic NICs,
293to running VMs. A new offer message is processed by
294vmbus_process_offer(), which indirectly invokes vmbus_add_channel_work().
295
296Upon receipt of an offer message, the guest identifies the device
297type based on the class ID, and invokes the correct driver to set up
298the device.  Driver/device matching is performed using the standard
299Linux mechanism.
300
301The device driver probe function opens the primary VMBus channel to
302the corresponding VSP. It allocates guest memory for the channel
303ring buffers and shares the ring buffer with the Hyper-V host by
304giving the host a list of GPAs for the ring buffer memory.  See
305vmbus_establish_gpadl().
306
307Once the ring buffer is set up, the device driver and VSP exchange
308setup messages via the primary channel.  These messages may include
309negotiating the device protocol version to be used between the Linux
310VSC and the VSP on the Hyper-V host.  The setup messages may also
311include creating additional VMBus channels, which are somewhat
312mis-named as "sub-channels" since they are functionally
313equivalent to the primary channel once they are created.
314
315Finally, the device driver may create entries in /dev as with
316any device driver.
317
318The Hyper-V host can send a "rescind" message to the guest to
319remove a device that was previously offered. Linux drivers must
320handle such a rescind message at any time. Rescinding a device
321invokes the device driver "remove" function to cleanly shut
322down the device and remove it. Once a synthetic device is
323rescinded, neither Hyper-V nor Linux retains any state about
324its previous existence. Such a device might be re-added later,
325in which case it is treated as an entirely new device. See
326vmbus_onoffer_rescind().
327