summaryrefslogtreecommitdiff
path: root/Documentation
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation')
-rw-r--r--Documentation/00-INDEX2
-rw-r--r--Documentation/ABI/testing/sysfs-platform-ideapad-laptop17
-rw-r--r--Documentation/CodingStyle23
-rw-r--r--Documentation/SubmittingDrivers2
-rw-r--r--Documentation/SubmittingPatches2
-rw-r--r--Documentation/acpi/apei/einj.txt11
-rw-r--r--Documentation/devicetree/bindings/gpio/gpio_keys.txt2
-rw-r--r--Documentation/devicetree/bindings/input/fsl-mma8450.txt11
-rw-r--r--Documentation/email-clients.txt12
-rw-r--r--Documentation/fault-injection/fault-injection.txt3
-rw-r--r--Documentation/feature-removal-schedule.txt20
-rw-r--r--Documentation/filesystems/befs.txt2
-rw-r--r--Documentation/frv/booting.txt13
-rw-r--r--Documentation/ioctl/ioctl-number.txt1
-rw-r--r--Documentation/kernel-docs.txt11
-rw-r--r--Documentation/kernel-parameters.txt82
-rw-r--r--Documentation/m68k/kernel-options.txt14
-rw-r--r--Documentation/networking/bonding.txt31
-rw-r--r--Documentation/networking/scaling.txt371
-rw-r--r--Documentation/power/runtime_pm.txt10
-rw-r--r--Documentation/ramoops.txt76
-rw-r--r--Documentation/virtual/00-INDEX3
-rw-r--r--Documentation/virtual/lguest/lguest.c3
-rw-r--r--Documentation/virtual/virtio-spec.txt2200
24 files changed, 2842 insertions, 80 deletions
diff --git a/Documentation/00-INDEX b/Documentation/00-INDEX
index 1f89424c36a6..65bbd2622396 100644
--- a/Documentation/00-INDEX
+++ b/Documentation/00-INDEX
@@ -272,6 +272,8 @@ printk-formats.txt
- how to get printk format specifiers right
prio_tree.txt
- info on radix-priority-search-tree use for indexing vmas.
+ramoops.txt
+ - documentation of the ramoops oops/panic logging module.
rbtree.txt
- info on what red-black trees are and what they are for.
robust-futex-ABI.txt
diff --git a/Documentation/ABI/testing/sysfs-platform-ideapad-laptop b/Documentation/ABI/testing/sysfs-platform-ideapad-laptop
index 807fca2ae2a4..ff53183c3848 100644
--- a/Documentation/ABI/testing/sysfs-platform-ideapad-laptop
+++ b/Documentation/ABI/testing/sysfs-platform-ideapad-laptop
@@ -4,3 +4,20 @@ KernelVersion: 2.6.37
Contact: "Ike Panhc <ike.pan@canonical.com>"
Description:
Control the power of camera module. 1 means on, 0 means off.
+
+What: /sys/devices/platform/ideapad/cfg
+Date: Jun 2011
+KernelVersion: 3.1
+Contact: "Ike Panhc <ike.pan@canonical.com>"
+Description:
+ Ideapad capability bits.
+ Bit 8-10: 1 - Intel graphic only
+ 2 - ATI graphic only
+ 3 - Nvidia graphic only
+ 4 - Intel and ATI graphic
+ 5 - Intel and Nvidia graphic
+ Bit 16: Bluetooth exist (1 for exist)
+ Bit 17: 3G exist (1 for exist)
+ Bit 18: Wifi exist (1 for exist)
+ Bit 19: Camera exist (1 for exist)
+
diff --git a/Documentation/CodingStyle b/Documentation/CodingStyle
index fa6e25b94a54..c940239d9678 100644
--- a/Documentation/CodingStyle
+++ b/Documentation/CodingStyle
@@ -80,22 +80,13 @@ available tools.
The limit on the length of lines is 80 columns and this is a strongly
preferred limit.
-Statements longer than 80 columns will be broken into sensible chunks.
-Descendants are always substantially shorter than the parent and are placed
-substantially to the right. The same applies to function headers with a long
-argument list. Long strings are as well broken into shorter strings. The
-only exception to this is where exceeding 80 columns significantly increases
-readability and does not hide information.
-
-void fun(int a, int b, int c)
-{
- if (condition)
- printk(KERN_WARNING "Warning this is a long printk with "
- "3 parameters a: %u b: %u "
- "c: %u \n", a, b, c);
- else
- next_statement;
-}
+Statements longer than 80 columns will be broken into sensible chunks, unless
+exceeding 80 columns significantly increases readability and does not hide
+information. Descendants are always substantially shorter than the parent and
+are placed substantially to the right. The same applies to function headers
+with a long argument list. However, never break user-visible strings such as
+printk messages, because that breaks the ability to grep for them.
+
Chapter 3: Placing Braces and Spaces
diff --git a/Documentation/SubmittingDrivers b/Documentation/SubmittingDrivers
index 319baa8b60dd..36d16bbf72c6 100644
--- a/Documentation/SubmittingDrivers
+++ b/Documentation/SubmittingDrivers
@@ -130,7 +130,7 @@ Linux kernel master tree:
ftp.??.kernel.org:/pub/linux/kernel/...
?? == your country code, such as "us", "uk", "fr", etc.
- http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git
+ http://git.kernel.org/?p=linux/kernel/git/torvalds/linux.git
Linux kernel mailing list:
linux-kernel@vger.kernel.org
diff --git a/Documentation/SubmittingPatches b/Documentation/SubmittingPatches
index 569f3532e138..4468ce24427c 100644
--- a/Documentation/SubmittingPatches
+++ b/Documentation/SubmittingPatches
@@ -303,7 +303,7 @@ patches that are being emailed around.
The sign-off is a simple line at the end of the explanation for the
patch, which certifies that you wrote it or otherwise have the right to
-pass it on as a open-source patch. The rules are pretty simple: if you
+pass it on as an open-source patch. The rules are pretty simple: if you
can certify the below:
Developer's Certificate of Origin 1.1
diff --git a/Documentation/acpi/apei/einj.txt b/Documentation/acpi/apei/einj.txt
index dfab71848dc8..5cc699ba5453 100644
--- a/Documentation/acpi/apei/einj.txt
+++ b/Documentation/acpi/apei/einj.txt
@@ -48,12 +48,19 @@ directory apei/einj. The following files are provided.
- param1
This file is used to set the first error parameter value. Effect of
parameter depends on error_type specified. For memory error, this is
- physical memory address.
+ physical memory address. Only available if param_extension module
+ parameter is specified.
- param2
This file is used to set the second error parameter value. Effect of
parameter depends on error_type specified. For memory error, this is
- physical memory address mask.
+ physical memory address mask. Only available if param_extension
+ module parameter is specified.
+
+Injecting parameter support is a BIOS version specific extension, that
+is, it only works on some BIOS version. If you want to use it, please
+make sure your BIOS version has the proper support and specify
+"param_extension=y" in module parameter.
For more information about EINJ, please refer to ACPI specification
version 4.0, section 17.5.
diff --git a/Documentation/devicetree/bindings/gpio/gpio_keys.txt b/Documentation/devicetree/bindings/gpio/gpio_keys.txt
index 7190c99d7611..5c2c02140a62 100644
--- a/Documentation/devicetree/bindings/gpio/gpio_keys.txt
+++ b/Documentation/devicetree/bindings/gpio/gpio_keys.txt
@@ -10,7 +10,7 @@ Optional properties:
Each button (key) is represented as a sub-node of "gpio-keys":
Subnode properties:
- - gpios: OF devcie-tree gpio specificatin.
+ - gpios: OF device-tree gpio specification.
- label: Descriptive name of the key.
- linux,code: Keycode to emit.
diff --git a/Documentation/devicetree/bindings/input/fsl-mma8450.txt b/Documentation/devicetree/bindings/input/fsl-mma8450.txt
new file mode 100644
index 000000000000..a00c94ccbdee
--- /dev/null
+++ b/Documentation/devicetree/bindings/input/fsl-mma8450.txt
@@ -0,0 +1,11 @@
+* Freescale MMA8450 3-Axis Accelerometer
+
+Required properties:
+- compatible : "fsl,mma8450".
+
+Example:
+
+accelerometer: mma8450@1c {
+ compatible = "fsl,mma8450";
+ reg = <0x1c>;
+};
diff --git a/Documentation/email-clients.txt b/Documentation/email-clients.txt
index a0b58e29f911..860c29a472ad 100644
--- a/Documentation/email-clients.txt
+++ b/Documentation/email-clients.txt
@@ -199,18 +199,16 @@ to coerce it into behaving.
To beat some sense out of the internal editor, do this:
-- Under account settings, composition and addressing, uncheck "Compose
- messages in HTML format".
-
- Edit your Thunderbird config settings so that it won't use format=flowed.
Go to "edit->preferences->advanced->config editor" to bring up the
thunderbird's registry editor, and set "mailnews.send_plaintext_flowed" to
"false".
-- Enable "preformat" mode: Shft-click on the Write icon to bring up the HTML
- composer, select "Preformat" from the drop-down box just under the subject
- line, then close the message without saving. (This setting also applies to
- the text composer, but the only control for it is in the HTML composer.)
+- Disable HTML Format: Set "mail.identity.id1.compose_html" to "false".
+
+- Enable "preformat" mode: Set "editor.quotesPreformatted" to "true".
+
+- Enable UTF8: Set "prefs.converted-to-utf8" to "true".
- Install the "toggle wordwrap" extension. Download the file from:
https://addons.mozilla.org/thunderbird/addon/2351/
diff --git a/Documentation/fault-injection/fault-injection.txt b/Documentation/fault-injection/fault-injection.txt
index 7be15e44d481..82a5d250d75e 100644
--- a/Documentation/fault-injection/fault-injection.txt
+++ b/Documentation/fault-injection/fault-injection.txt
@@ -143,8 +143,7 @@ o provide a way to configure fault attributes
failslab, fail_page_alloc, and fail_make_request use this way.
Helper functions:
- init_fault_attr_dentries(entries, attr, name);
- void cleanup_fault_attr_dentries(entries);
+ fault_create_debugfs_attr(name, parent, attr);
- module parameters
diff --git a/Documentation/feature-removal-schedule.txt b/Documentation/feature-removal-schedule.txt
index ea0bace0124a..c4a6e148732a 100644
--- a/Documentation/feature-removal-schedule.txt
+++ b/Documentation/feature-removal-schedule.txt
@@ -296,15 +296,6 @@ Who: Ravikiran Thirumalai <kiran@scalex86.org>
---------------------------
-What: CONFIG_THERMAL_HWMON
-When: January 2009
-Why: This option was introduced just to allow older lm-sensors userspace
- to keep working over the upgrade to 2.6.26. At the scheduled time of
- removal fixed lm-sensors (2.x or 3.x) should be readily available.
-Who: Rene Herman <rene.herman@gmail.com>
-
----------------------------
-
What: Code that is now under CONFIG_WIRELESS_EXT_SYSFS
(in net/core/net-sysfs.c)
When: After the only user (hal) has seen a release with the patches
@@ -590,3 +581,14 @@ Why: This driver has been superseded by g_mass_storage.
Who: Alan Stern <stern@rowland.harvard.edu>
----------------------------
+
+What: threeg and interface sysfs files in /sys/devices/platform/acer-wmi
+When: 2012
+Why: In 3.0, we can now autodetect internal 3G device and already have
+ the threeg rfkill device. So, we plan to remove threeg sysfs support
+ for it's no longer necessary.
+
+ We also plan to remove interface sysfs file that exposed which ACPI-WMI
+ interface that was used by acer-wmi driver. It will replaced by
+ information log when acer-wmi initial.
+Who: Lee, Chun-Yi <jlee@novell.com>
diff --git a/Documentation/filesystems/befs.txt b/Documentation/filesystems/befs.txt
index 6e49c363938e..da45e6c842b8 100644
--- a/Documentation/filesystems/befs.txt
+++ b/Documentation/filesystems/befs.txt
@@ -27,7 +27,7 @@ His original code can still be found at:
Does anyone know of a more current email address for Makoto? He doesn't
respond to the address given above...
-Current maintainer: Sergey S. Kostyliov <rathamahata@php4.ru>
+This filesystem doesn't have a maintainer.
WHAT IS THIS DRIVER?
==================
diff --git a/Documentation/frv/booting.txt b/Documentation/frv/booting.txt
index ace200b7c214..37c4d84a0e57 100644
--- a/Documentation/frv/booting.txt
+++ b/Documentation/frv/booting.txt
@@ -106,13 +106,20 @@ separated by spaces:
To use the first on-chip serial port at baud rate 115200, no parity, 8
bits, and no flow control.
- (*) root=/dev/<xxxx>
+ (*) root=<xxxx>
- This specifies the device upon which the root filesystem resides. For
- example:
+ This specifies the device upon which the root filesystem resides. It
+ may be specified by major and minor number, device path, or even
+ partition uuid, if supported. For example:
/dev/nfs NFS root filesystem
/dev/mtdblock3 Fourth RedBoot partition on the System Flash
+ PARTUUID=00112233-4455-6677-8899-AABBCCDDEEFF/PARTNROFF=1
+ first partition after the partition with the given UUID
+ 253:0 Device with major 253 and minor 0
+
+ Authoritative information can be found in
+ "Documentation/kernel-parameters.txt".
(*) rw
diff --git a/Documentation/ioctl/ioctl-number.txt b/Documentation/ioctl/ioctl-number.txt
index 72ba8d51dbc1..845a191004b1 100644
--- a/Documentation/ioctl/ioctl-number.txt
+++ b/Documentation/ioctl/ioctl-number.txt
@@ -292,6 +292,7 @@ Code Seq#(hex) Include File Comments
<mailto:buk@buks.ipn.de>
0xA0 all linux/sdp/sdp.h Industrial Device Project
<mailto:kenji@bitgate.com>
+0xA2 00-0F arch/tile/include/asm/hardwall.h
0xA3 80-8F Port ACL in development:
<mailto:tlewis@mindspring.com>
0xA3 90-9F linux/dtlk.h
diff --git a/Documentation/kernel-docs.txt b/Documentation/kernel-docs.txt
index 9a8674629a07..0e0734b509d8 100644
--- a/Documentation/kernel-docs.txt
+++ b/Documentation/kernel-docs.txt
@@ -620,17 +620,6 @@
(including this document itself) have been moved there, and might
be more up to date than the web version.
- * Name: "Linux Source Driver"
- URL: http://lsd.linux.cz
- Keywords: Browsing source code.
- Description: "Linux Source Driver (LSD) is an application, which
- can make browsing source codes of Linux kernel easier than you can
- imagine. You can select between multiple versions of kernel (e.g.
- 0.01, 1.0.0, 2.0.33, 2.0.34pre13, 2.0.0, 2.1.101 etc.). With LSD
- you can search Linux kernel (fulltext, macros, types, functions
- and variables) and LSD can generate patches for you on the fly
- (files, directories or kernel)".
-
* Name: "Linux Kernel Source Reference"
Author: Thomas Graichen.
URL: http://marc.info/?l=linux-kernel&m=96446640102205&w=4
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 26a83743af19..6ca1f5cb71e0 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -40,6 +40,7 @@ parameter is applicable:
ALSA ALSA sound support is enabled.
APIC APIC support is enabled.
APM Advanced Power Management support is enabled.
+ ARM ARM architecture is enabled.
AVR32 AVR32 architecture is enabled.
AX25 Appropriate AX.25 support is enabled.
BLACKFIN Blackfin architecture is enabled.
@@ -49,6 +50,7 @@ parameter is applicable:
EFI EFI Partitioning (GPT) is enabled
EIDE EIDE/ATAPI support is enabled.
FB The frame buffer device is enabled.
+ FTRACE Function tracing enabled.
GCOV GCOV profiling is enabled.
HW Appropriate hardware is enabled.
IA-64 IA-64 architecture is enabled.
@@ -69,6 +71,7 @@ parameter is applicable:
Documentation/m68k/kernel-options.txt.
MCA MCA bus support is enabled.
MDA MDA console support is enabled.
+ MIPS MIPS architecture is enabled.
MOUSE Appropriate mouse support is enabled.
MSI Message Signaled Interrupts (PCI).
MTD MTD (Memory Technology Device) support is enabled.
@@ -100,7 +103,6 @@ parameter is applicable:
SPARC Sparc architecture is enabled.
SWSUSP Software suspend (hibernation) is enabled.
SUSPEND System suspend states are enabled.
- FTRACE Function tracing enabled.
TPM TPM drivers are enabled.
TS Appropriate touchscreen support is enabled.
UMS USB Mass Storage support is enabled.
@@ -115,7 +117,7 @@ parameter is applicable:
X86-64 X86-64 architecture is enabled.
More X86-64 boot options can be found in
Documentation/x86/x86_64/boot-options.txt .
- X86 Either 32bit or 64bit x86 (same as X86-32+X86-64)
+ X86 Either 32-bit or 64-bit x86 (same as X86-32+X86-64)
XEN Xen support is enabled
In addition, the following text indicates that the option:
@@ -163,6 +165,11 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
See also Documentation/power/pm.txt, pci=noacpi
+ acpi_rsdp= [ACPI,EFI,KEXEC]
+ Pass the RSDP address to the kernel, mostly used
+ on machines running EFI runtime service to boot the
+ second kernel for kdump.
+
acpi_apic_instance= [ACPI, IOAPIC]
Format: <int>
2: use 2nd APIC table, if available
@@ -371,7 +378,7 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
atkbd.softrepeat= [HW]
Use software keyboard repeat
- autotest [IA64]
+ autotest [IA-64]
baycom_epp= [HW,AX25]
Format: <io>,<mode>
@@ -546,6 +553,9 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
/proc/<pid>/coredump_filter.
See also Documentation/filesystems/proc.txt.
+ cpuidle.off=1 [CPU_IDLE]
+ disable the cpuidle sub-system
+
cpcihp_generic= [HW,PCI] Generic port I/O CompactPCI driver
Format:
<first_slot>,<last_slot>,<port>,<enum_bit>[,<debug>]
@@ -673,8 +683,8 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
uart[8250],mmio32,<addr>[,options]
Start an early, polled-mode console on the 8250/16550
UART at the specified I/O port or MMIO address.
- MMIO inter-register address stride is either 8bit (mmio)
- or 32bit (mmio32).
+ MMIO inter-register address stride is either 8-bit
+ (mmio) or 32-bit (mmio32).
The options are the same as for ttyS, above.
earlyprintk= [X86,SH,BLACKFIN]
@@ -717,7 +727,7 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
See Documentation/block/as-iosched.txt and
Documentation/block/deadline-iosched.txt for details.
- elfcorehdr= [IA64,PPC,SH,X86]
+ elfcorehdr= [IA-64,PPC,SH,X86]
Specifies physical address of start of kernel core
image elf header. Generally kexec loader will
pass this option to capture kernel.
@@ -783,7 +793,7 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
tracer at boot up. function-list is a comma separated
list of functions. This list can be changed at run
time by the set_ftrace_filter file in the debugfs
- tracing directory.
+ tracing directory.
ftrace_notrace=[function-list]
[FTRACE] Do not trace the functions specified in
@@ -821,7 +831,7 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
hashdist= [KNL,NUMA] Large hashes allocated during boot
are distributed across NUMA nodes. Defaults on
- for 64bit NUMA, off otherwise.
+ for 64-bit NUMA, off otherwise.
Format: 0 | 1 (for off | on)
hcl= [IA-64] SGI's Hardware Graph compatibility layer
@@ -990,10 +1000,10 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
DMA.
forcedac [x86_64]
With this option iommu will not optimize to look
- for io virtual address below 32 bit forcing dual
+ for io virtual address below 32-bit forcing dual
address cycle on pci bus for cards supporting greater
- than 32 bit addressing. The default is to look
- for translation below 32 bit and if not available
+ than 32-bit addressing. The default is to look
+ for translation below 32-bit and if not available
then look in the higher range.
strict [Default Off]
With this option on every unmap_single operation will
@@ -1009,7 +1019,7 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
off disable Interrupt Remapping
nosid disable Source ID checking
- inttest= [IA64]
+ inttest= [IA-64]
iomem= Disable strict checking of access to MMIO memory
strict regions from userspace.
@@ -1026,7 +1036,7 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
nomerge
forcesac
soft
- pt [x86, IA64]
+ pt [x86, IA-64]
io7= [HW] IO7 for Marvel based alpha systems
See comment before marvel_specify_io7 in
@@ -1157,7 +1167,7 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
kvm-amd.npt= [KVM,AMD] Disable nested paging (virtualized MMU)
for all guests.
- Default is 1 (enabled) if in 64bit or 32bit-PAE mode
+ Default is 1 (enabled) if in 64-bit or 32-bit PAE mode.
kvm-intel.ept= [KVM,Intel] Disable extended page tables
(virtualized MMU) support on capable Intel chips.
@@ -1194,10 +1204,10 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
libata.dma=0 Disable all PATA and SATA DMA
libata.dma=1 PATA and SATA Disk DMA only
libata.dma=2 ATAPI (CDROM) DMA only
- libata.dma=4 Compact Flash DMA only
+ libata.dma=4 Compact Flash DMA only
Combinations also work, so libata.dma=3 enables DMA
for disks and CDROMs, but not CFs.
-
+
libata.ignore_hpa= [LIBATA] Ignore HPA limit
libata.ignore_hpa=0 keep BIOS limits (default)
libata.ignore_hpa=1 ignore limits, using full disk
@@ -1323,7 +1333,7 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
ltpc= [NET]
Format: <io>,<irq>,<dma>
- machvec= [IA64] Force the use of a particular machine-vector
+ machvec= [IA-64] Force the use of a particular machine-vector
(machvec) in a generic kernel.
Example: machvec=hpzx1_swiotlb
@@ -1726,7 +1736,7 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
nointroute [IA-64]
- nojitter [IA64] Disables jitter checking for ITC timers.
+ nojitter [IA-64] Disables jitter checking for ITC timers.
no-kvmclock [X86,KVM] Disable paravirtualized KVM clock driver
@@ -1792,7 +1802,7 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
nox2apic [X86-64,APIC] Do not enable x2APIC mode.
- nptcg= [IA64] Override max number of concurrent global TLB
+ nptcg= [IA-64] Override max number of concurrent global TLB
purges which is reported from either PAL_VM_SUMMARY or
SAL PALO.
@@ -2069,7 +2079,7 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
Format: { parport<nr> | timid | 0 }
See also Documentation/parport.txt.
- pmtmr= [X86] Manual setup of pmtmr I/O Port.
+ pmtmr= [X86] Manual setup of pmtmr I/O Port.
Override pmtimer IOPort with a hex value.
e.g. pmtmr=0x508
@@ -2240,6 +2250,7 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
ro [KNL] Mount root device read-only on boot
root= [KNL] Root filesystem
+ See name_to_dev_t comment in init/do_mounts.c.
rootdelay= [KNL] Delay (in seconds) to pause before attempting to
mount the root filesystem
@@ -2626,6 +2637,16 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
medium is write-protected).
Example: quirks=0419:aaf5:rl,0421:0433:rc
+ user_debug= [KNL,ARM]
+ Format: <int>
+ See arch/arm/Kconfig.debug help text.
+ 1 - undefined instruction events
+ 2 - system calls
+ 4 - invalid data aborts
+ 8 - SIGSEGV faults
+ 16 - SIGBUS faults
+ Example: user_debug=31
+
userpte=
[X86] Flags controlling user PTE allocations.
@@ -2671,6 +2692,27 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
vmpoff= [KNL,S390] Perform z/VM CP command after power off.
Format: <command>
+ vsyscall= [X86-64]
+ Controls the behavior of vsyscalls (i.e. calls to
+ fixed addresses of 0xffffffffff600x00 from legacy
+ code). Most statically-linked binaries and older
+ versions of glibc use these calls. Because these
+ functions are at fixed addresses, they make nice
+ targets for exploits that can control RIP.
+
+ emulate [default] Vsyscalls turn into traps and are
+ emulated reasonably safely.
+
+ native Vsyscalls are native syscall instructions.
+ This is a little bit faster than trapping
+ and makes a few dynamic recompilers work
+ better than they would in emulation mode.
+ It also makes exploits much easier to write.
+
+ none Vsyscalls don't work at all. This makes
+ them quite hard to use for exploits but
+ might break your system.
+
vt.cur_default= [VT] Default cursor shape.
Format: 0xCCBBAA, where AA, BB, and CC are the same as
the parameters of the <Esc>[?A;B;Cc escape sequence;
diff --git a/Documentation/m68k/kernel-options.txt b/Documentation/m68k/kernel-options.txt
index c93bed66e25d..97d45f276fe6 100644
--- a/Documentation/m68k/kernel-options.txt
+++ b/Documentation/m68k/kernel-options.txt
@@ -129,6 +129,20 @@ decimal 11 is the major of SCSI CD-ROMs, and the minor 0 stands for
the first of these. You can find out all valid major numbers by
looking into include/linux/major.h.
+In addition to major and minor numbers, if the device containing your
+root partition uses a partition table format with unique partition
+identifiers, then you may use them. For instance,
+"root=PARTUUID=00112233-4455-6677-8899-AABBCCDDEEFF". It is also
+possible to reference another partition on the same device using a
+known partition UUID as the starting point. For example,
+if partition 5 of the device has the UUID of
+00112233-4455-6677-8899-AABBCCDDEEFF then partition 3 may be found as
+follows:
+ PARTUUID=00112233-4455-6677-8899-AABBCCDDEEFF/PARTNROFF=-2
+
+Authoritative information can be found in
+"Documentation/kernel-parameters.txt".
+
2.2) ro, rw
-----------
diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt
index 675612ff41ae..91df678fb7f8 100644
--- a/Documentation/networking/bonding.txt
+++ b/Documentation/networking/bonding.txt
@@ -238,6 +238,18 @@ ad_select
This option was added in bonding version 3.4.0.
+all_slaves_active
+
+ Specifies that duplicate frames (received on inactive ports) should be
+ dropped (0) or delivered (1).
+
+ Normally, bonding will drop duplicate frames (received on inactive
+ ports), which is desirable for most users. But there are some times
+ it is nice to allow duplicate frames to be delivered.
+
+ The default value is 0 (drop duplicate frames received on inactive
+ ports).
+
arp_interval
Specifies the ARP link monitoring frequency in milliseconds.
@@ -433,6 +445,23 @@ miimon
determined. See the High Availability section for additional
information. The default value is 0.
+min_links
+
+ Specifies the minimum number of links that must be active before
+ asserting carrier. It is similar to the Cisco EtherChannel min-links
+ feature. This allows setting the minimum number of member ports that
+ must be up (link-up state) before marking the bond device as up
+ (carrier on). This is useful for situations where higher level services
+ such as clustering want to ensure a minimum number of low bandwidth
+ links are active before switchover. This option only affect 802.3ad
+ mode.
+
+ The default value is 0. This will cause carrier to be asserted (for
+ 802.3ad mode) whenever there is an active aggregator, regardless of the
+ number of available links in that aggregator. Note that, because an
+ aggregator cannot be active without at least one available link,
+ setting this option to 0 or to 1 has the exact same effect.
+
mode
Specifies one of the bonding policies. The default is
@@ -599,7 +628,7 @@ num_unsol_na
affect only the active-backup mode. These options were added for
bonding versions 3.3.0 and 3.4.0 respectively.
- From Linux 2.6.40 and bonding version 3.7.1, these notifications
+ From Linux 3.0 and bonding version 3.7.1, these notifications
are generated by the ipv4 and ipv6 code and the numbers of
repetitions cannot be set independently.
diff --git a/Documentation/networking/scaling.txt b/Documentation/networking/scaling.txt
new file mode 100644
index 000000000000..7254b4b5910e
--- /dev/null
+++ b/Documentation/networking/scaling.txt
@@ -0,0 +1,371 @@
+Scaling in the Linux Networking Stack
+
+
+Introduction
+============
+
+This document describes a set of complementary techniques in the Linux
+networking stack to increase parallelism and improve performance for
+multi-processor systems.
+
+The following technologies are described:
+
+ RSS: Receive Side Scaling
+ RPS: Receive Packet Steering
+ RFS: Receive Flow Steering
+ Accelerated Receive Flow Steering
+ XPS: Transmit Packet Steering
+
+
+RSS: Receive Side Scaling
+=========================
+
+Contemporary NICs support multiple receive and transmit descriptor queues
+(multi-queue). On reception, a NIC can send different packets to different
+queues to distribute processing among CPUs. The NIC distributes packets by
+applying a filter to each packet that assigns it to one of a small number
+of logical flows. Packets for each flow are steered to a separate receive
+queue, which in turn can be processed by separate CPUs. This mechanism is
+generally known as “Receive-side Scaling” (RSS). The goal of RSS and
+the other scaling techniques to increase performance uniformly.
+Multi-queue distribution can also be used for traffic prioritization, but
+that is not the focus of these techniques.
+
+The filter used in RSS is typically a hash function over the network
+and/or transport layer headers-- for example, a 4-tuple hash over
+IP addresses and TCP ports of a packet. The most common hardware
+implementation of RSS uses a 128-entry indirection table where each entry
+stores a queue number. The receive queue for a packet is determined
+by masking out the low order seven bits of the computed hash for the
+packet (usually a Toeplitz hash), taking this number as a key into the
+indirection table and reading the corresponding value.
+
+Some advanced NICs allow steering packets to queues based on
+programmable filters. For example, webserver bound TCP port 80 packets
+can be directed to their own receive queue. Such “n-tuple” filters can
+be configured from ethtool (--config-ntuple).
+
+==== RSS Configuration
+
+The driver for a multi-queue capable NIC typically provides a kernel
+module parameter for specifying the number of hardware queues to
+configure. In the bnx2x driver, for instance, this parameter is called
+num_queues. A typical RSS configuration would be to have one receive queue
+for each CPU if the device supports enough queues, or otherwise at least
+one for each cache domain at a particular cache level (L1, L2, etc.).
+
+The indirection table of an RSS device, which resolves a queue by masked
+hash, is usually programmed by the driver at initialization. The
+default mapping is to distribute the queues evenly in the table, but the
+indirection table can be retrieved and modified at runtime using ethtool
+commands (--show-rxfh-indir and --set-rxfh-indir). Modifying the
+indirection table could be done to give different queues different
+relative weights.
+
+== RSS IRQ Configuration
+
+Each receive queue has a separate IRQ associated with it. The NIC triggers
+this to notify a CPU when new packets arrive on the given queue. The
+signaling path for PCIe devices uses message signaled interrupts (MSI-X),
+that can route each interrupt to a particular CPU. The active mapping
+of queues to IRQs can be determined from /proc/interrupts. By default,
+an IRQ may be handled on any CPU. Because a non-negligible part of packet
+processing takes place in receive interrupt handling, it is advantageous
+to spread receive interrupts between CPUs. To manually adjust the IRQ
+affinity of each interrupt see Documentation/IRQ-affinity. Some systems
+will be running irqbalance, a daemon that dynamically optimizes IRQ
+assignments and as a result may override any manual settings.
+
+== Suggested Configuration
+
+RSS should be enabled when latency is a concern or whenever receive
+interrupt processing forms a bottleneck. Spreading load between CPUs
+decreases queue length. For low latency networking, the optimal setting
+is to allocate as many queues as there are CPUs in the system (or the
+NIC maximum, if lower). Because the aggregate number of interrupts grows
+with each additional queue, the most efficient high-rate configuration
+is likely the one with the smallest number of receive queues where no
+CPU that processes receive interrupts reaches 100% utilization. Per-cpu
+load can be observed using the mpstat utility.
+
+
+RPS: Receive Packet Steering
+============================
+
+Receive Packet Steering (RPS) is logically a software implementation of
+RSS. Being in software, it is necessarily called later in the datapath.
+Whereas RSS selects the queue and hence CPU that will run the hardware
+interrupt handler, RPS selects the CPU to perform protocol processing
+above the interrupt handler. This is accomplished by placing the packet
+on the desired CPU’s backlog queue and waking up the CPU for processing.
+RPS has some advantages over RSS: 1) it can be used with any NIC,
+2) software filters can easily be added to hash over new protocols,
+3) it does not increase hardware device interrupt rate (although it does
+introduce inter-processor interrupts (IPIs)).
+
+RPS is called during bottom half of the receive interrupt handler, when
+a driver sends a packet up the network stack with netif_rx() or
+netif_receive_skb(). These call the get_rps_cpu() function, which
+selects the queue that should process a packet.
+
+The first step in determining the target CPU for RPS is to calculate a
+flow hash over the packet’s addresses or ports (2-tuple or 4-tuple hash
+depending on the protocol). This serves as a consistent hash of the
+associated flow of the packet. The hash is either provided by hardware
+or will be computed in the stack. Capable hardware can pass the hash in
+the receive descriptor for the packet; this would usually be the same
+hash used for RSS (e.g. computed Toeplitz hash). The hash is saved in
+skb->rx_hash and can be used elsewhere in the stack as a hash of the
+packet’s flow.
+
+Each receive hardware queue has an associated list of CPUs to which
+RPS may enqueue packets for processing. For each received packet,
+an index into the list is computed from the flow hash modulo the size
+of the list. The indexed CPU is the target for processing the packet,
+and the packet is queued to the tail of that CPU’s backlog queue. At
+the end of the bottom half routine, IPIs are sent to any CPUs for which
+packets have been queued to their backlog queue. The IPI wakes backlog
+processing on the remote CPU, and any queued packets are then processed
+up the networking stack.
+
+==== RPS Configuration
+
+RPS requires a kernel compiled with the CONFIG_RPS kconfig symbol (on
+by default for SMP). Even when compiled in, RPS remains disabled until
+explicitly configured. The list of CPUs to which RPS may forward traffic
+can be configured for each receive queue using a sysfs file entry:
+
+ /sys/class/net/<dev>/queues/rx-<n>/rps_cpus
+
+This file implements a bitmap of CPUs. RPS is disabled when it is zero
+(the default), in which case packets are processed on the interrupting
+CPU. Documentation/IRQ-affinity.txt explains how CPUs are assigned to
+the bitmap.
+
+== Suggested Configuration
+
+For a single queue device, a typical RPS configuration would be to set
+the rps_cpus to the CPUs in the same cache domain of the interrupting
+CPU. If NUMA locality is not an issue, this could also be all CPUs in
+the system. At high interrupt rate, it might be wise to exclude the
+interrupting CPU from the map since that already performs much work.
+
+For a multi-queue system, if RSS is configured so that a hardware
+receive queue is mapped to each CPU, then RPS is probably redundant
+and unnecessary. If there are fewer hardware queues than CPUs, then
+RPS might be beneficial if the rps_cpus for each queue are the ones that
+share the same cache domain as the interrupting CPU for that queue.
+
+
+RFS: Receive Flow Steering
+==========================
+
+While RPS steers packets solely based on hash, and thus generally
+provides good load distribution, it does not take into account
+application locality. This is accomplished by Receive Flow Steering
+(RFS). The goal of RFS is to increase datacache hitrate by steering
+kernel processing of packets to the CPU where the application thread
+consuming the packet is running. RFS relies on the same RPS mechanisms
+to enqueue packets onto the backlog of another CPU and to wake up that
+CPU.
+
+In RFS, packets are not forwarded directly by the value of their hash,
+but the hash is used as index into a flow lookup table. This table maps
+flows to the CPUs where those flows are being processed. The flow hash
+(see RPS section above) is used to calculate the index into this table.
+The CPU recorded in each entry is the one which last processed the flow.
+If an entry does not hold a valid CPU, then packets mapped to that entry
+are steered using plain RPS. Multiple table entries may point to the
+same CPU. Indeed, with many flows and few CPUs, it is very likely that
+a single application thread handles flows with many different flow hashes.
+
+rps_sock_table is a global flow table that contains the *desired* CPU for
+flows: the CPU that is currently processing the flow in userspace. Each
+table value is a CPU index that is updated during calls to recvmsg and
+sendmsg (specifically, inet_recvmsg(), inet_sendmsg(), inet_sendpage()
+and tcp_splice_read()).
+
+When the scheduler moves a thread to a new CPU while it has outstanding
+receive packets on the old CPU, packets may arrive out of order. To
+avoid this, RFS uses a second flow table to track outstanding packets
+for each flow: rps_dev_flow_table is a table specific to each hardware
+receive queue of each device. Each table value stores a CPU index and a
+counter. The CPU index represents the *current* CPU onto which packets
+for this flow are enqueued for further kernel processing. Ideally, kernel
+and userspace processing occur on the same CPU, and hence the CPU index
+in both tables is identical. This is likely false if the scheduler has
+recently migrated a userspace thread while the kernel still has packets
+enqueued for kernel processing on the old CPU.
+
+The counter in rps_dev_flow_table values records the length of the current
+CPU's backlog when a packet in this flow was last enqueued. Each backlog
+queue has a head counter that is incremented on dequeue. A tail counter
+is computed as head counter + queue length. In other words, the counter
+in rps_dev_flow_table[i] records the last element in flow i that has
+been enqueued onto the currently designated CPU for flow i (of course,
+entry i is actually selected by hash and multiple flows may hash to the
+same entry i).
+
+And now the trick for avoiding out of order packets: when selecting the
+CPU for packet processing (from get_rps_cpu()) the rps_sock_flow table
+and the rps_dev_flow table of the queue that the packet was received on
+are compared. If the desired CPU for the flow (found in the
+rps_sock_flow table) matches the current CPU (found in the rps_dev_flow
+table), the packet is enqueued onto that CPU’s backlog. If they differ,
+the current CPU is updated to match the desired CPU if one of the
+following is true:
+
+- The current CPU's queue head counter >= the recorded tail counter
+ value in rps_dev_flow[i]
+- The current CPU is unset (equal to NR_CPUS)
+- The current CPU is offline
+
+After this check, the packet is sent to the (possibly updated) current
+CPU. These rules aim to ensure that a flow only moves to a new CPU when
+there are no packets outstanding on the old CPU, as the outstanding
+packets could arrive later than those about to be processed on the new
+CPU.
+
+==== RFS Configuration
+
+RFS is only available if the kconfig symbol CONFIG_RFS is enabled (on
+by default for SMP). The functionality remains disabled until explicitly
+configured. The number of entries in the global flow table is set through:
+
+ /proc/sys/net/core/rps_sock_flow_entries
+
+The number of entries in the per-queue flow table are set through:
+
+ /sys/class/net/<dev>/queues/tx-<n>/rps_flow_cnt
+
+== Suggested Configuration
+
+Both of these need to be set before RFS is enabled for a receive queue.
+Values for both are rounded up to the nearest power of two. The
+suggested flow count depends on the expected number of active connections
+at any given time, which may be significantly less than the number of open
+connections. We have found that a value of 32768 for rps_sock_flow_entries
+works fairly well on a moderately loaded server.
+
+For a single queue device, the rps_flow_cnt value for the single queue
+would normally be configured to the same value as rps_sock_flow_entries.
+For a multi-queue device, the rps_flow_cnt for each queue might be
+configured as rps_sock_flow_entries / N, where N is the number of
+queues. So for instance, if rps_flow_entries is set to 32768 and there
+are 16 configured receive queues, rps_flow_cnt for each queue might be
+configured as 2048.
+
+
+Accelerated RFS
+===============
+
+Accelerated RFS is to RFS what RSS is to RPS: a hardware-accelerated load
+balancing mechanism that uses soft state to steer flows based on where
+the application thread consuming the packets of each flow is running.
+Accelerated RFS should perform better than RFS since packets are sent
+directly to a CPU local to the thread consuming the data. The target CPU
+will either be the same CPU where the application runs, or at least a CPU
+which is local to the application thread’s CPU in the cache hierarchy.
+
+To enable accelerated RFS, the networking stack calls the
+ndo_rx_flow_steer driver function to communicate the desired hardware
+queue for packets matching a particular flow. The network stack
+automatically calls this function every time a flow entry in
+rps_dev_flow_table is updated. The driver in turn uses a device specific
+method to program the NIC to steer the packets.
+
+The hardware queue for a flow is derived from the CPU recorded in
+rps_dev_flow_table. The stack consults a CPU to hardware queue map which
+is maintained by the NIC driver. This is an auto-generated reverse map of
+the IRQ affinity table shown by /proc/interrupts. Drivers can use
+functions in the cpu_rmap (“CPU affinity reverse map”) kernel library
+to populate the map. For each CPU, the corresponding queue in the map is
+set to be one whose processing CPU is closest in cache locality.
+
+==== Accelerated RFS Configuration
+
+Accelerated RFS is only available if the kernel is compiled with
+CONFIG_RFS_ACCEL and support is provided by the NIC device and driver.
+It also requires that ntuple filtering is enabled via ethtool. The map
+of CPU to queues is automatically deduced from the IRQ affinities
+configured for each receive queue by the driver, so no additional
+configuration should be necessary.
+
+== Suggested Configuration
+
+This technique should be enabled whenever one wants to use RFS and the
+NIC supports hardware acceleration.
+
+XPS: Transmit Packet Steering
+=============================
+
+Transmit Packet Steering is a mechanism for intelligently selecting
+which transmit queue to use when transmitting a packet on a multi-queue
+device. To accomplish this, a mapping from CPU to hardware queue(s) is
+recorded. The goal of this mapping is usually to assign queues
+exclusively to a subset of CPUs, where the transmit completions for
+these queues are processed on a CPU within this set. This choice
+provides two benefits. First, contention on the device queue lock is
+significantly reduced since fewer CPUs contend for the same queue
+(contention can be eliminated completely if each CPU has its own
+transmit queue). Secondly, cache miss rate on transmit completion is
+reduced, in particular for data cache lines that hold the sk_buff
+structures.
+
+XPS is configured per transmit queue by setting a bitmap of CPUs that
+may use that queue to transmit. The reverse mapping, from CPUs to
+transmit queues, is computed and maintained for each network device.
+When transmitting the first packet in a flow, the function
+get_xps_queue() is called to select a queue. This function uses the ID
+of the running CPU as a key into the CPU-to-queue lookup table. If the
+ID matches a single queue, that is used for transmission. If multiple
+queues match, one is selected by using the flow hash to compute an index
+into the set.
+
+The queue chosen for transmitting a particular flow is saved in the
+corresponding socket structure for the flow (e.g. a TCP connection).
+This transmit queue is used for subsequent packets sent on the flow to
+prevent out of order (ooo) packets. The choice also amortizes the cost
+of calling get_xps_queues() over all packets in the connection. To avoid
+ooo packets, the queue for a flow can subsequently only be changed if
+skb->ooo_okay is set for a packet in the flow. This flag indicates that
+there are no outstanding packets in the flow, so the transmit queue can
+change without the risk of generating out of order packets. The
+transport layer is responsible for setting ooo_okay appropriately. TCP,
+for instance, sets the flag when all data for a connection has been
+acknowledged.
+
+==== XPS Configuration
+
+XPS is only available if the kconfig symbol CONFIG_XPS is enabled (on by
+default for SMP). The functionality remains disabled until explicitly
+configured. To enable XPS, the bitmap of CPUs that may use a transmit
+queue is configured using the sysfs file entry:
+
+/sys/class/net/<dev>/queues/tx-<n>/xps_cpus
+
+== Suggested Configuration
+
+For a network device with a single transmission queue, XPS configuration
+has no effect, since there is no choice in this case. In a multi-queue
+system, XPS is preferably configured so that each CPU maps onto one queue.
+If there are as many queues as there are CPUs in the system, then each
+queue can also map onto one CPU, resulting in exclusive pairings that
+experience no contention. If there are fewer queues than CPUs, then the
+best CPUs to share a given queue are probably those that share the cache
+with the CPU that processes transmit completions for that queue
+(transmit interrupts).
+
+
+Further Information
+===================
+RPS and RFS were introduced in kernel 2.6.35. XPS was incorporated into
+2.6.38. Original patches were submitted by Tom Herbert
+(therbert@google.com)
+
+Accelerated RFS was introduced in 2.6.35. Original patches were
+submitted by Ben Hutchings (bhutchings@solarflare.com)
+
+Authors:
+Tom Herbert (therbert@google.com)
+Willem de Bruijn (willemb@google.com)
diff --git a/Documentation/power/runtime_pm.txt b/Documentation/power/runtime_pm.txt
index 14dd3c6ad97e..4ce5450ab6e8 100644
--- a/Documentation/power/runtime_pm.txt
+++ b/Documentation/power/runtime_pm.txt
@@ -54,11 +54,10 @@ referred to as subsystem-level callbacks in what follows.
By default, the callbacks are always invoked in process context with interrupts
enabled. However, subsystems can use the pm_runtime_irq_safe() helper function
to tell the PM core that a device's ->runtime_suspend() and ->runtime_resume()
-callbacks should be invoked in atomic context with interrupts disabled
-(->runtime_idle() is still invoked the default way). This implies that these
-callback routines must not block or sleep, but it also means that the
-synchronous helper functions listed at the end of Section 4 can be used within
-an interrupt handler or in an atomic context.
+callbacks should be invoked in atomic context with interrupts disabled.
+This implies that these callback routines must not block or sleep, but it also
+means that the synchronous helper functions listed at the end of Section 4 can
+be used within an interrupt handler or in an atomic context.
The subsystem-level suspend callback is _entirely_ _responsible_ for handling
the suspend of the device as appropriate, which may, but need not include
@@ -483,6 +482,7 @@ pm_runtime_suspend()
pm_runtime_autosuspend()
pm_runtime_resume()
pm_runtime_get_sync()
+pm_runtime_put_sync()
pm_runtime_put_sync_suspend()
5. Runtime PM Initialization, Device Probing and Removal
diff --git a/Documentation/ramoops.txt b/Documentation/ramoops.txt
new file mode 100644
index 000000000000..8fb1ba7fe7bf
--- /dev/null
+++ b/Documentation/ramoops.txt
@@ -0,0 +1,76 @@
+Ramoops oops/panic logger
+=========================
+
+Sergiu Iordache <sergiu@chromium.org>
+
+Updated: 8 August 2011
+
+0. Introduction
+
+Ramoops is an oops/panic logger that writes its logs to RAM before the system
+crashes. It works by logging oopses and panics in a circular buffer. Ramoops
+needs a system with persistent RAM so that the content of that area can
+survive after a restart.
+
+1. Ramoops concepts
+
+Ramoops uses a predefined memory area to store the dump. The start and size of
+the memory area are set using two variables:
+ * "mem_address" for the start
+ * "mem_size" for the size. The memory size will be rounded down to a
+ power of two.
+
+The memory area is divided into "record_size" chunks (also rounded down to
+power of two) and each oops/panic writes a "record_size" chunk of
+information.
+
+Dumping both oopses and panics can be done by setting 1 in the "dump_oops"
+variable while setting 0 in that variable dumps only the panics.
+
+The module uses a counter to record multiple dumps but the counter gets reset
+on restart (i.e. new dumps after the restart will overwrite old ones).
+
+2. Setting the parameters
+
+Setting the ramoops parameters can be done in 2 different manners:
+ 1. Use the module parameters (which have the names of the variables described
+ as before).
+ 2. Use a platform device and set the platform data. The parameters can then
+ be set through that platform data. An example of doing that is:
+
+#include <linux/ramoops.h>
+[...]
+
+static struct ramoops_platform_data ramoops_data = {
+ .mem_size = <...>,
+ .mem_address = <...>,
+ .record_size = <...>,
+ .dump_oops = <...>,
+};
+
+static struct platform_device ramoops_dev = {
+ .name = "ramoops",
+ .dev = {
+ .platform_data = &ramoops_data,
+ },
+};
+
+[... inside a function ...]
+int ret;
+
+ret = platform_device_register(&ramoops_dev);
+if (ret) {
+ printk(KERN_ERR "unable to register platform device\n");
+ return ret;
+}
+
+3. Dump format
+
+The data dump begins with a header, currently defined as "====" followed by a
+timestamp and a new line. The dump then continues with the actual data.
+
+4. Reading the data
+
+The dump data can be read from memory (through /dev/mem or other means).
+Getting the module parameters, which are needed in order to parse the data, can
+be done through /sys/module/ramoops/parameters/* .
diff --git a/Documentation/virtual/00-INDEX b/Documentation/virtual/00-INDEX
index fe0251c4cfb7..8e601991d91c 100644
--- a/Documentation/virtual/00-INDEX
+++ b/Documentation/virtual/00-INDEX
@@ -8,3 +8,6 @@ lguest/
- Extremely simple hypervisor for experimental/educational use.
uml/
- User Mode Linux, builds/runs Linux kernel as a userspace program.
+virtio.txt
+ - Text version of draft virtio spec.
+ See http://ozlabs.org/~rusty/virtio-spec
diff --git a/Documentation/virtual/lguest/lguest.c b/Documentation/virtual/lguest/lguest.c
index 043bd7df3139..d928c134dee6 100644
--- a/Documentation/virtual/lguest/lguest.c
+++ b/Documentation/virtual/lguest/lguest.c
@@ -1996,6 +1996,9 @@ int main(int argc, char *argv[])
/* We use a simple helper to copy the arguments separated by spaces. */
concat((char *)(boot + 1), argv+optind+2);
+ /* Set kernel alignment to 16M (CONFIG_PHYSICAL_ALIGN) */
+ boot->hdr.kernel_alignment = 0x1000000;
+
/* Boot protocol version: 2.07 supports the fields for lguest. */
boot->hdr.version = 0x207;
diff --git a/Documentation/virtual/virtio-spec.txt b/Documentation/virtual/virtio-spec.txt
new file mode 100644
index 000000000000..a350ae135b8c
--- /dev/null
+++ b/Documentation/virtual/virtio-spec.txt
@@ -0,0 +1,2200 @@
+[Generated file: see http://ozlabs.org/~rusty/virtio-spec/]
+Virtio PCI Card Specification
+v0.9.1 DRAFT
+-
+
+Rusty Russell <rusty@rustcorp.com.au>IBM Corporation (Editor)
+
+2011 August 1.
+
+Purpose and Description
+
+This document describes the specifications of the “virtio” family
+of PCI[LaTeX Command: nomenclature] devices. These are devices
+are found in virtual environments[LaTeX Command: nomenclature],
+yet by design they are not all that different from physical PCI
+devices, and this document treats them as such. This allows the
+guest to use standard PCI drivers and discovery mechanisms.
+
+The purpose of virtio and this specification is that virtual
+environments and guests should have a straightforward, efficient,
+standard and extensible mechanism for virtual devices, rather
+than boutique per-environment or per-OS mechanisms.
+
+ Straightforward: Virtio PCI devices use normal PCI mechanisms
+ of interrupts and DMA which should be familiar to any device
+ driver author. There is no exotic page-flipping or COW
+ mechanism: it's just a PCI device.[footnote:
+This lack of page-sharing implies that the implementation of the
+device (e.g. the hypervisor or host) needs full access to the
+guest memory. Communication with untrusted parties (i.e.
+inter-guest communication) requires copying.
+]
+
+ Efficient: Virtio PCI devices consist of rings of descriptors
+ for input and output, which are neatly separated to avoid cache
+ effects from both guest and device writing to the same cache
+ lines.
+
+ Standard: Virtio PCI makes no assumptions about the environment
+ in which it operates, beyond supporting PCI. In fact the virtio
+ devices specified in the appendices do not require PCI at all:
+ they have been implemented on non-PCI buses.[footnote:
+The Linux implementation further separates the PCI virtio code
+from the specific virtio drivers: these drivers are shared with
+the non-PCI implementations (currently lguest and S/390).
+]
+
+ Extensible: Virtio PCI devices contain feature bits which are
+ acknowledged by the guest operating system during device setup.
+ This allows forwards and backwards compatibility: the device
+ offers all the features it knows about, and the driver
+ acknowledges those it understands and wishes to use.
+
+ Virtqueues
+
+The mechanism for bulk data transport on virtio PCI devices is
+pretentiously called a virtqueue. Each device can have zero or
+more virtqueues: for example, the network device has one for
+transmit and one for receive.
+
+Each virtqueue occupies two or more physically-contiguous pages
+(defined, for the purposes of this specification, as 4096 bytes),
+and consists of three parts:
+
+
++-------------------+-----------------------------------+-----------+
+| Descriptor Table | Available Ring (padding) | Used Ring |
++-------------------+-----------------------------------+-----------+
+
+
+When the driver wants to send buffers to the device, it puts them
+in one or more slots in the descriptor table, and writes the
+descriptor indices into the available ring. It then notifies the
+device. When the device has finished with the buffers, it writes
+the descriptors into the used ring, and sends an interrupt.
+
+Specification
+
+ PCI Discovery
+
+Any PCI device with Vendor ID 0x1AF4, and Device ID 0x1000
+through 0x103F inclusive is a virtio device[footnote:
+The actual value within this range is ignored
+]. The device must also have a Revision ID of 0 to match this
+specification.
+
+The Subsystem Device ID indicates which virtio device is
+supported by the device. The Subsystem Vendor ID should reflect
+the PCI Vendor ID of the environment (it's currently only used
+for informational purposes by the guest).
+
+
++----------------------+--------------------+---------------+
+| Subsystem Device ID | Virtio Device | Specification |
++----------------------+--------------------+---------------+
++----------------------+--------------------+---------------+
+| 1 | network card | Appendix C |
++----------------------+--------------------+---------------+
+| 2 | block device | Appendix D |
++----------------------+--------------------+---------------+
+| 3 | console | Appendix E |
++----------------------+--------------------+---------------+
+| 4 | entropy source | Appendix F |
++----------------------+--------------------+---------------+
+| 5 | memory ballooning | Appendix G |
++----------------------+--------------------+---------------+
+| 6 | ioMemory | - |
++----------------------+--------------------+---------------+
+| 9 | 9P transport | - |
++----------------------+--------------------+---------------+
+
+
+ Device Configuration
+
+To configure the device, we use the first I/O region of the PCI
+device. This contains a virtio header followed by a
+device-specific region.
+
+There may be different widths of accesses to the I/O region; the “
+natural” access method for each field in the virtio header must
+be used (i.e. 32-bit accesses for 32-bit fields, etc), but the
+device-specific region can be accessed using any width accesses,
+and should obtain the same results.
+
+Note that this is possible because while the virtio header is PCI
+(i.e. little) endian, the device-specific region is encoded in
+the native endian of the guest (where such distinction is
+applicable).
+
+ Device Initialization Sequence
+
+We start with an overview of device initialization, then expand
+on the details of the device and how each step is preformed.
+
+ Reset the device. This is not required on initial start up.
+
+ The ACKNOWLEDGE status bit is set: we have noticed the device.
+
+ The DRIVER status bit is set: we know how to drive the device.
+
+ Device-specific setup, including reading the Device Feature
+ Bits, discovery of virtqueues for the device, optional MSI-X
+ setup, and reading and possibly writing the virtio
+ configuration space.
+
+ The subset of Device Feature Bits understood by the driver is
+ written to the device.
+
+ The DRIVER_OK status bit is set.
+
+ The device can now be used (ie. buffers added to the
+ virtqueues)[footnote:
+Historically, drivers have used the device before steps 5 and 6.
+This is only allowed if the driver does not use any features
+which would alter this early use of the device.
+]
+
+If any of these steps go irrecoverably wrong, the guest should
+set the FAILED status bit to indicate that it has given up on the
+device (it can reset the device later to restart if desired).
+
+We now cover the fields required for general setup in detail.
+
+ Virtio Header
+
+The virtio header looks as follows:
+
+
++------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+
+| Bits || 32 | 32 | 32 | 16 | 16 | 16 | 8 | 8 |
++------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+
+| Read/Write || R | R+W | R+W | R | R+W | R+W | R+W | R |
++------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+
+| Purpose || Device | Guest | Queue | Queue | Queue | Queue | Device | ISR |
+| || Features bits 0:31 | Features bits 0:31 | Address | Size | Select | Notify | Status | Status |
++------------++---------------------+---------------------+----------+--------+---------+---------+---------+--------+
+
+
+If MSI-X is enabled for the device, two additional fields
+immediately follow this header:
+
+
++------------++----------------+--------+
+| Bits || 16 | 16 |
+ +----------------+--------+
++------------++----------------+--------+
+| Read/Write || R+W | R+W |
++------------++----------------+--------+
+| Purpose || Configuration | Queue |
+| (MSI-X) || Vector | Vector |
++------------++----------------+--------+
+
+
+Finally, if feature bits (VIRTIO_F_FEATURES_HI) this is
+immediately followed by two additional fields:
+
+
++------------++----------------------+----------------------
+| Bits || 32 | 32
++------------++----------------------+----------------------
+| Read/Write || R | R+W
++------------++----------------------+----------------------
+| Purpose || Device | Guest
+| || Features bits 32:63 | Features bits 32:63
++------------++----------------------+----------------------
+
+
+Immediately following these general headers, there may be
+device-specific headers:
+
+
++------------++--------------------+
+| Bits || Device Specific |
+ +--------------------+
++------------++--------------------+
+| Read/Write || Device Specific |
++------------++--------------------+
+| Purpose || Device Specific... |
+| || |
++------------++--------------------+
+
+
+ Device Status
+
+The Device Status field is updated by the guest to indicate its
+progress. This provides a simple low-level diagnostic: it's most
+useful to imagine them hooked up to traffic lights on the console
+indicating the status of each device.
+
+The device can be reset by writing a 0 to this field, otherwise
+at least one bit should be set:
+
+ ACKNOWLEDGE (1) Indicates that the guest OS has found the
+ device and recognized it as a valid virtio device.
+
+ DRIVER (2) Indicates that the guest OS knows how to drive the
+ device. Under Linux, drivers can be loadable modules so there
+ may be a significant (or infinite) delay before setting this
+ bit.
+
+ DRIVER_OK (3) Indicates that the driver is set up and ready to
+ drive the device.
+
+ FAILED (8) Indicates that something went wrong in the guest,
+ and it has given up on the device. This could be an internal
+ error, or the driver didn't like the device for some reason, or
+ even a fatal error during device operation. The device must be
+ reset before attempting to re-initialize.
+
+ Feature Bits
+
+The least significant 31 bits of the first configuration field
+indicates the features that the device supports (the high bit is
+reserved, and will be used to indicate the presence of future
+feature bits elsewhere). If more than 31 feature bits are
+supported, the device indicates so by setting feature bit 31 (see
+[cha:Reserved-Feature-Bits]). The bits are allocated as follows:
+
+ 0 to 23 Feature bits for the specific device type
+
+ 24 to 40 Feature bits reserved for extensions to the queue and
+ feature negotiation mechanisms
+
+ 41 to 63 Feature bits reserved for future extensions
+
+For example, feature bit 0 for a network device (i.e. Subsystem
+Device ID 1) indicates that the device supports checksumming of
+packets.
+
+The feature bits are negotiated: the device lists all the
+features it understands in the Device Features field, and the
+guest writes the subset that it understands into the Guest
+Features field. The only way to renegotiate is to reset the
+device.
+
+In particular, new fields in the device configuration header are
+indicated by offering a feature bit, so the guest can check
+before accessing that part of the configuration space.
+
+This allows for forwards and backwards compatibility: if the
+device is enhanced with a new feature bit, older guests will not
+write that feature bit back to the Guest Features field and it
+can go into backwards compatibility mode. Similarly, if a guest
+is enhanced with a feature that the device doesn't support, it
+will not see that feature bit in the Device Features field and
+can go into backwards compatibility mode (or, for poor
+implementations, set the FAILED Device Status bit).
+
+Access to feature bits 32 to 63 is enabled by Guest by setting
+feature bit 31. If this bit is unset, Device must assume that all
+feature bits > 31 are unset.
+
+ Configuration/Queue Vectors
+
+When MSI-X capability is present and enabled in the device
+(through standard PCI configuration space) 4 bytes at byte offset
+20 are used to map configuration change and queue interrupts to
+MSI-X vectors. In this case, the ISR Status field is unused, and
+device specific configuration starts at byte offset 24 in virtio
+header structure. When MSI-X capability is not enabled, device
+specific configuration starts at byte offset 20 in virtio header.
+
+Writing a valid MSI-X Table entry number, 0 to 0x7FF, to one of
+Configuration/Queue Vector registers, maps interrupts triggered
+by the configuration change/selected queue events respectively to
+the corresponding MSI-X vector. To disable interrupts for a
+specific event type, unmap it by writing a special NO_VECTOR
+value:
+
+/* Vector value used to disable MSI for queue */
+
+#define VIRTIO_MSI_NO_VECTOR 0xffff
+
+Reading these registers returns vector mapped to a given event,
+or NO_VECTOR if unmapped. All queue and configuration change
+events are unmapped by default.
+
+Note that mapping an event to vector might require allocating
+internal device resources, and might fail. Devices report such
+failures by returning the NO_VECTOR value when the relevant
+Vector field is read. After mapping an event to vector, the
+driver must verify success by reading the Vector field value: on
+success, the previously written value is returned, and on
+failure, NO_VECTOR is returned. If a mapping failure is detected,
+the driver can retry mapping with fewervectors, or disable MSI-X.
+
+ Virtqueue Configuration
+
+As a device can have zero or more virtqueues for bulk data
+transport (for example, the network driver has two), the driver
+needs to configure them as part of the device-specific
+configuration.
+
+This is done as follows, for each virtqueue a device has:
+
+ Write the virtqueue index (first queue is 0) to the Queue
+ Select field.
+
+ Read the virtqueue size from the Queue Size field, which is
+ always a power of 2. This controls how big the virtqueue is
+ (see below). If this field is 0, the virtqueue does not exist.
+
+ Allocate and zero virtqueue in contiguous physical memory, on a
+ 4096 byte alignment. Write the physical address, divided by
+ 4096 to the Queue Address field.[footnote:
+The 4096 is based on the x86 page size, but it's also large
+enough to ensure that the separate parts of the virtqueue are on
+separate cache lines.
+]
+
+ Optionally, if MSI-X capability is present and enabled on the
+ device, select a vector to use to request interrupts triggered
+ by virtqueue events. Write the MSI-X Table entry number
+ corresponding to this vector in Queue Vector field. Read the
+ Queue Vector field: on success, previously written value is
+ returned; on failure, NO_VECTOR value is returned.
+
+The Queue Size field controls the total number of bytes required
+for the virtqueue according to the following formula:
+
+#define ALIGN(x) (((x) + 4095) & ~4095)
+
+static inline unsigned vring_size(unsigned int qsz)
+
+{
+
+ return ALIGN(sizeof(struct vring_desc)*qsz + sizeof(u16)*(2
++ qsz))
+
+ + ALIGN(sizeof(struct vring_used_elem)*qsz);
+
+}
+
+This currently wastes some space with padding, but also allows
+future extensions. The virtqueue layout structure looks like this
+(qsz is the Queue Size field, which is a variable, so this code
+won't compile):
+
+struct vring {
+
+ /* The actual descriptors (16 bytes each) */
+
+ struct vring_desc desc[qsz];
+
+
+
+ /* A ring of available descriptor heads with free-running
+index. */
+
+ struct vring_avail avail;
+
+
+
+ // Padding to the next 4096 boundary.
+
+ char pad[];
+
+
+
+ // A ring of used descriptor heads with free-running index.
+
+ struct vring_used used;
+
+};
+
+ A Note on Virtqueue Endianness
+
+Note that the endian of these fields and everything else in the
+virtqueue is the native endian of the guest, not little-endian as
+PCI normally is. This makes for simpler guest code, and it is
+assumed that the host already has to be deeply aware of the guest
+endian so such an “endian-aware” device is not a significant
+issue.
+
+ Descriptor Table
+
+The descriptor table refers to the buffers the guest is using for
+the device. The addresses are physical addresses, and the buffers
+can be chained via the next field. Each descriptor describes a
+buffer which is read-only or write-only, but a chain of
+descriptors can contain both read-only and write-only buffers.
+
+No descriptor chain may be more than 2^32 bytes long in total.struct vring_desc {
+
+ /* Address (guest-physical). */
+
+ u64 addr;
+
+ /* Length. */
+
+ u32 len;
+
+/* This marks a buffer as continuing via the next field. */
+
+#define VRING_DESC_F_NEXT 1
+
+/* This marks a buffer as write-only (otherwise read-only). */
+
+#define VRING_DESC_F_WRITE 2
+
+/* This means the buffer contains a list of buffer descriptors.
+*/
+
+#define VRING_DESC_F_INDIRECT 4
+
+ /* The flags as indicated above. */
+
+ u16 flags;
+
+ /* Next field if flags & NEXT */
+
+ u16 next;
+
+};
+
+The number of descriptors in the table is specified by the Queue
+Size field for this virtqueue.
+
+ <sub:Indirect-Descriptors>Indirect Descriptors
+
+Some devices benefit by concurrently dispatching a large number
+of large requests. The VIRTIO_RING_F_INDIRECT_DESC feature can be
+used to allow this (see [cha:Reserved-Feature-Bits]). To increase
+ring capacity it is possible to store a table of indirect
+descriptors anywhere in memory, and insert a descriptor in main
+virtqueue (with flags&INDIRECT on) that refers to memory buffer
+containing this indirect descriptor table; fields addr and len
+refer to the indirect table address and length in bytes,
+respectively. The indirect table layout structure looks like this
+(len is the length of the descriptor that refers to this table,
+which is a variable, so this code won't compile):
+
+struct indirect_descriptor_table {
+
+ /* The actual descriptors (16 bytes each) */
+
+ struct vring_desc desc[len / 16];
+
+};
+
+The first indirect descriptor is located at start of the indirect
+descriptor table (index 0), additional indirect descriptors are
+chained by next field. An indirect descriptor without next field
+(with flags&NEXT off) signals the end of the indirect descriptor
+table, and transfers control back to the main virtqueue. An
+indirect descriptor can not refer to another indirect descriptor
+table (flags&INDIRECT must be off). A single indirect descriptor
+table can include both read-only and write-only descriptors;
+write-only flag (flags&WRITE) in the descriptor that refers to it
+is ignored.
+
+ Available Ring
+
+The available ring refers to what descriptors we are offering the
+device: it refers to the head of a descriptor chain. The “flags”
+field is currently 0 or 1: 1 indicating that we do not need an
+interrupt when the device consumes a descriptor from the
+available ring. Alternatively, the guest can ask the device to
+delay interrupts until an entry with an index specified by the “
+used_event” field is written in the used ring (equivalently,
+until the idx field in the used ring will reach the value
+used_event + 1). The method employed by the device is controlled
+by the VIRTIO_RING_F_EVENT_IDX feature bit (see [cha:Reserved-Feature-Bits]
+). This interrupt suppression is merely an optimization; it may
+not suppress interrupts entirely.
+
+The “idx” field indicates where we would put the next descriptor
+entry (modulo the ring size). This starts at 0, and increases.
+
+struct vring_avail {
+
+#define VRING_AVAIL_F_NO_INTERRUPT 1
+
+ u16 flags;
+
+ u16 idx;
+
+ u16 ring[qsz]; /* qsz is the Queue Size field read from device
+*/
+
+ u16 used_event;
+
+};
+
+ Used Ring
+
+The used ring is where the device returns buffers once it is done
+with them. The flags field can be used by the device to hint that
+no notification is necessary when the guest adds to the available
+ring. Alternatively, the “avail_event” field can be used by the
+device to hint that no notification is necessary until an entry
+with an index specified by the “avail_event” is written in the
+available ring (equivalently, until the idx field in the
+available ring will reach the value avail_event + 1). The method
+employed by the device is controlled by the guest through the
+VIRTIO_RING_F_EVENT_IDX feature bit (see [cha:Reserved-Feature-Bits]
+). [footnote:
+These fields are kept here because this is the only part of the
+virtqueue written by the device
+].
+
+Each entry in the ring is a pair: the head entry of the
+descriptor chain describing the buffer (this matches an entry
+placed in the available ring by the guest earlier), and the total
+of bytes written into the buffer. The latter is extremely useful
+for guests using untrusted buffers: if you do not know exactly
+how much has been written by the device, you usually have to zero
+the buffer to ensure no data leakage occurs.
+
+/* u32 is used here for ids for padding reasons. */
+
+struct vring_used_elem {
+
+ /* Index of start of used descriptor chain. */
+
+ u32 id;
+
+ /* Total length of the descriptor chain which was used
+(written to) */
+
+ u32 len;
+
+};
+
+
+
+struct vring_used {
+
+#define VRING_USED_F_NO_NOTIFY 1
+
+ u16 flags;
+
+ u16 idx;
+
+ struct vring_used_elem ring[qsz];
+
+ u16 avail_event;
+
+};
+
+ Helpers for Managing Virtqueues
+
+The Linux Kernel Source code contains the definitions above and
+helper routines in a more usable form, in
+include/linux/virtio_ring.h. This was explicitly licensed by IBM
+and Red Hat under the (3-clause) BSD license so that it can be
+freely used by all other projects, and is reproduced (with slight
+variation to remove Linux assumptions) in Appendix A.
+
+ Device Operation
+
+There are two parts to device operation: supplying new buffers to
+the device, and processing used buffers from the device. As an
+example, the virtio network device has two virtqueues: the
+transmit virtqueue and the receive virtqueue. The driver adds
+outgoing (read-only) packets to the transmit virtqueue, and then
+frees them after they are used. Similarly, incoming (write-only)
+buffers are added to the receive virtqueue, and processed after
+they are used.
+
+ Supplying Buffers to The Device
+
+Actual transfer of buffers from the guest OS to the device
+operates as follows:
+
+ Place the buffer(s) into free descriptor(s).
+
+ If there are no free descriptors, the guest may choose to
+ notify the device even if notifications are suppressed (to
+ reduce latency).[footnote:
+The Linux drivers do this only for read-only buffers: for
+write-only buffers, it is assumed that the driver is merely
+trying to keep the receive buffer ring full, and no notification
+of this expected condition is necessary.
+]
+
+ Place the id of the buffer in the next ring entry of the
+ available ring.
+
+ The steps (1) and (2) may be performed repeatedly if batching
+ is possible.
+
+ A memory barrier should be executed to ensure the device sees
+ the updated descriptor table and available ring before the next
+ step.
+
+ The available “idx” field should be increased by the number of
+ entries added to the available ring.
+
+ A memory barrier should be executed to ensure that we update
+ the idx field before checking for notification suppression.
+
+ If notifications are not suppressed, the device should be
+ notified of the new buffers.
+
+Note that the above code does not take precautions against the
+available ring buffer wrapping around: this is not possible since
+the ring buffer is the same size as the descriptor table, so step
+(1) will prevent such a condition.
+
+In addition, the maximum queue size is 32768 (it must be a power
+of 2 which fits in 16 bits), so the 16-bit “idx” value can always
+distinguish between a full and empty buffer.
+
+Here is a description of each stage in more detail.
+
+ Placing Buffers Into The Descriptor Table
+
+A buffer consists of zero or more read-only physically-contiguous
+elements followed by zero or more physically-contiguous
+write-only elements (it must have at least one element). This
+algorithm maps it into the descriptor table:
+
+ for each buffer element, b:
+
+ Get the next free descriptor table entry, d
+
+ Set d.addr to the physical address of the start of b
+
+ Set d.len to the length of b.
+
+ If b is write-only, set d.flags to VRING_DESC_F_WRITE,
+ otherwise 0.
+
+ If there is a buffer element after this:
+
+ Set d.next to the index of the next free descriptor element.
+
+ Set the VRING_DESC_F_NEXT bit in d.flags.
+
+In practice, the d.next fields are usually used to chain free
+descriptors, and a separate count kept to check there are enough
+free descriptors before beginning the mappings.
+
+ Updating The Available Ring
+
+The head of the buffer we mapped is the first d in the algorithm
+above. A naive implementation would do the following:
+
+avail->ring[avail->idx % qsz] = head;
+
+However, in general we can add many descriptors before we update
+the “idx” field (at which point they become visible to the
+device), so we keep a counter of how many we've added:
+
+avail->ring[(avail->idx + added++) % qsz] = head;
+
+ Updating The Index Field
+
+Once the idx field of the virtqueue is updated, the device will
+be able to access the descriptor entries we've created and the
+memory they refer to. This is why a memory barrier is generally
+used before the idx update, to ensure it sees the most up-to-date
+copy.
+
+The idx field always increments, and we let it wrap naturally at
+65536:
+
+avail->idx += added;
+
+ <sub:Notifying-The-Device>Notifying The Device
+
+Device notification occurs by writing the 16-bit virtqueue index
+of this virtqueue to the Queue Notify field of the virtio header
+in the first I/O region of the PCI device. This can be expensive,
+however, so the device can suppress such notifications if it
+doesn't need them. We have to be careful to expose the new idx
+value before checking the suppression flag: it's OK to notify
+gratuitously, but not to omit a required notification. So again,
+we use a memory barrier here before reading the flags or the
+avail_event field.
+
+If the VIRTIO_F_RING_EVENT_IDX feature is not negotiated, and if
+the VRING_USED_F_NOTIFY flag is not set, we go ahead and write to
+the PCI configuration space.
+
+If the VIRTIO_F_RING_EVENT_IDX feature is negotiated, we read the
+avail_event field in the available ring structure. If the
+available index crossed_the avail_event field value since the
+last notification, we go ahead and write to the PCI configuration
+space. The avail_event field wraps naturally at 65536 as well:
+
+(u16)(new_idx - avail_event - 1) < (u16)(new_idx - old_idx)
+
+ <sub:Receiving-Used-Buffers>Receiving Used Buffers From The
+ Device
+
+Once the device has used a buffer (read from or written to it, or
+parts of both, depending on the nature of the virtqueue and the
+device), it sends an interrupt, following an algorithm very
+similar to the algorithm used for the driver to send the device a
+buffer:
+
+ Write the head descriptor number to the next field in the used
+ ring.
+
+ Update the used ring idx.
+
+ Determine whether an interrupt is necessary:
+
+ If the VIRTIO_F_RING_EVENT_IDX feature is not negotiated: check
+ if f the VRING_AVAIL_F_NO_INTERRUPT flag is not set in avail-
+ >flags
+
+ If the VIRTIO_F_RING_EVENT_IDX feature is negotiated: check
+ whether the used index crossed the used_event field value
+ since the last update. The used_event field wraps naturally
+ at 65536 as well:(u16)(new_idx - used_event - 1) < (u16)(new_idx - old_idx)
+
+ If an interrupt is necessary:
+
+ If MSI-X capability is disabled:
+
+ Set the lower bit of the ISR Status field for the device.
+
+ Send the appropriate PCI interrupt for the device.
+
+ If MSI-X capability is enabled:
+
+ Request the appropriate MSI-X interrupt message for the
+ device, Queue Vector field sets the MSI-X Table entry
+ number.
+
+ If Queue Vector field value is NO_VECTOR, no interrupt
+ message is requested for this event.
+
+The guest interrupt handler should:
+
+ If MSI-X capability is disabled: read the ISR Status field,
+ which will reset it to zero. If the lower bit is zero, the
+ interrupt was not for this device. Otherwise, the guest driver
+ should look through the used rings of each virtqueue for the
+ device, to see if any progress has been made by the device
+ which requires servicing.
+
+ If MSI-X capability is enabled: look through the used rings of
+ each virtqueue mapped to the specific MSI-X vector for the
+ device, to see if any progress has been made by the device
+ which requires servicing.
+
+For each ring, guest should then disable interrupts by writing
+VRING_AVAIL_F_NO_INTERRUPT flag in avail structure, if required.
+It can then process used ring entries finally enabling interrupts
+by clearing the VRING_AVAIL_F_NO_INTERRUPT flag or updating the
+EVENT_IDX field in the available structure, Guest should then
+execute a memory barrier, and then recheck the ring empty
+condition. This is necessary to handle the case where, after the
+last check and before enabling interrupts, an interrupt has been
+suppressed by the device:
+
+vring_disable_interrupts(vq);
+
+for (;;) {
+
+ if (vq->last_seen_used != vring->used.idx) {
+
+ vring_enable_interrupts(vq);
+
+ mb();
+
+ if (vq->last_seen_used != vring->used.idx)
+
+ break;
+
+ }
+
+ struct vring_used_elem *e =
+vring.used->ring[vq->last_seen_used%vsz];
+
+ process_buffer(e);
+
+ vq->last_seen_used++;
+
+}
+
+ Dealing With Configuration Changes
+
+Some virtio PCI devices can change the device configuration
+state, as reflected in the virtio header in the PCI configuration
+space. In this case:
+
+ If MSI-X capability is disabled: an interrupt is delivered and
+ the second highest bit is set in the ISR Status field to
+ indicate that the driver should re-examine the configuration
+ space.Note that a single interrupt can indicate both that one
+ or more virtqueue has been used and that the configuration
+ space has changed: even if the config bit is set, virtqueues
+ must be scanned.
+
+ If MSI-X capability is enabled: an interrupt message is
+ requested. The Configuration Vector field sets the MSI-X Table
+ entry number to use. If Configuration Vector field value is
+ NO_VECTOR, no interrupt message is requested for this event.
+
+Creating New Device Types
+
+Various considerations are necessary when creating a new device
+type:
+
+ How Many Virtqueues?
+
+It is possible that a very simple device will operate entirely
+through its configuration space, but most will need at least one
+virtqueue in which it will place requests. A device with both
+input and output (eg. console and network devices described here)
+need two queues: one which the driver fills with buffers to
+receive input, and one which the driver places buffers to
+transmit output.
+
+ What Configuration Space Layout?
+
+Configuration space is generally used for rarely-changing or
+initialization-time parameters. But it is a limited resource, so
+it might be better to use a virtqueue to update configuration
+information (the network device does this for filtering,
+otherwise the table in the config space could potentially be very
+large).
+
+Note that this space is generally the guest's native endian,
+rather than PCI's little-endian.
+
+ What Device Number?
+
+Currently device numbers are assigned quite freely: a simple
+request mail to the author of this document or the Linux
+virtualization mailing list[footnote:
+
+https://lists.linux-foundation.org/mailman/listinfo/virtualization
+] will be sufficient to secure a unique one.
+
+Meanwhile for experimental drivers, use 65535 and work backwards.
+
+ How many MSI-X vectors?
+
+Using the optional MSI-X capability devices can speed up
+interrupt processing by removing the need to read ISR Status
+register by guest driver (which might be an expensive operation),
+reducing interrupt sharing between devices and queues within the
+device, and handling interrupts from multiple CPUs. However, some
+systems impose a limit (which might be as low as 256) on the
+total number of MSI-X vectors that can be allocated to all
+devices. Devices and/or device drivers should take this into
+account, limiting the number of vectors used unless the device is
+expected to cause a high volume of interrupts. Devices can
+control the number of vectors used by limiting the MSI-X Table
+Size or not presenting MSI-X capability in PCI configuration
+space. Drivers can control this by mapping events to as small
+number of vectors as possible, or disabling MSI-X capability
+altogether.
+
+ Message Framing
+
+The descriptors used for a buffer should not effect the semantics
+of the message, except for the total length of the buffer. For
+example, a network buffer consists of a 10 byte header followed
+by the network packet. Whether this is presented in the ring
+descriptor chain as (say) a 10 byte buffer and a 1514 byte
+buffer, or a single 1524 byte buffer, or even three buffers,
+should have no effect.
+
+In particular, no implementation should use the descriptor
+boundaries to determine the size of any header in a request.[footnote:
+The current qemu device implementations mistakenly insist that
+the first descriptor cover the header in these cases exactly, so
+a cautious driver should arrange it so.
+]
+
+ Device Improvements
+
+Any change to configuration space, or new virtqueues, or
+behavioural changes, should be indicated by negotiation of a new
+feature bit. This establishes clarity[footnote:
+Even if it does mean documenting design or implementation
+mistakes!
+] and avoids future expansion problems.
+
+Clusters of functionality which are always implemented together
+can use a single bit, but if one feature makes sense without the
+others they should not be gratuitously grouped together to
+conserve feature bits. We can always extend the spec when the
+first person needs more than 24 feature bits for their device.
+
+[LaTeX Command: printnomenclature]
+
+Appendix A: virtio_ring.h
+
+#ifndef VIRTIO_RING_H
+
+#define VIRTIO_RING_H
+
+/* An interface for efficient virtio implementation.
+
+ *
+
+ * This header is BSD licensed so anyone can use the definitions
+
+ * to implement compatible drivers/servers.
+
+ *
+
+ * Copyright 2007, 2009, IBM Corporation
+
+ * Copyright 2011, Red Hat, Inc
+
+ * All rights reserved.
+
+ *
+
+ * Redistribution and use in source and binary forms, with or
+without
+
+ * modification, are permitted provided that the following
+conditions
+
+ * are met:
+
+ * 1. Redistributions of source code must retain the above
+copyright
+
+ * notice, this list of conditions and the following
+disclaimer.
+
+ * 2. Redistributions in binary form must reproduce the above
+copyright
+
+ * notice, this list of conditions and the following
+disclaimer in the
+
+ * documentation and/or other materials provided with the
+distribution.
+
+ * 3. Neither the name of IBM nor the names of its contributors
+
+ * may be used to endorse or promote products derived from
+this software
+
+ * without specific prior written permission.
+
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND
+CONTRIBUTORS ``AS IS'' AND
+
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED
+TO, THE
+
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A
+PARTICULAR PURPOSE
+
+ * ARE DISCLAIMED. IN NO EVENT SHALL IBM OR CONTRIBUTORS BE
+LIABLE
+
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+CONSEQUENTIAL
+
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+SUBSTITUTE GOODS
+
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS
+INTERRUPTION)
+
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN
+CONTRACT, STRICT
+
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING
+IN ANY WAY
+
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE
+POSSIBILITY OF
+
+ * SUCH DAMAGE.
+
+ */
+
+
+
+/* This marks a buffer as continuing via the next field. */
+
+#define VRING_DESC_F_NEXT 1
+
+/* This marks a buffer as write-only (otherwise read-only). */
+
+#define VRING_DESC_F_WRITE 2
+
+
+
+/* The Host uses this in used->flags to advise the Guest: don't
+kick me
+
+ * when you add a buffer. It's unreliable, so it's simply an
+
+ * optimization. Guest will still kick if it's out of buffers.
+*/
+
+#define VRING_USED_F_NO_NOTIFY 1
+
+/* The Guest uses this in avail->flags to advise the Host: don't
+
+ * interrupt me when you consume a buffer. It's unreliable, so
+it's
+
+ * simply an optimization. */
+
+#define VRING_AVAIL_F_NO_INTERRUPT 1
+
+
+
+/* Virtio ring descriptors: 16 bytes.
+
+ * These can chain together via "next". */
+
+struct vring_desc {
+
+ /* Address (guest-physical). */
+
+ uint64_t addr;
+
+ /* Length. */
+
+ uint32_t len;
+
+ /* The flags as indicated above. */
+
+ uint16_t flags;
+
+ /* We chain unused descriptors via this, too */
+
+ uint16_t next;
+
+};
+
+
+
+struct vring_avail {
+
+ uint16_t flags;
+
+ uint16_t idx;
+
+ uint16_t ring[];
+
+ uint16_t used_event;
+
+};
+
+
+
+/* u32 is used here for ids for padding reasons. */
+
+struct vring_used_elem {
+
+ /* Index of start of used descriptor chain. */
+
+ uint32_t id;
+
+ /* Total length of the descriptor chain which was written
+to. */
+
+ uint32_t len;
+
+};
+
+
+
+struct vring_used {
+
+ uint16_t flags;
+
+ uint16_t idx;
+
+ struct vring_used_elem ring[];
+
+ uint16_t avail_event;
+
+};
+
+
+
+struct vring {
+
+ unsigned int num;
+
+
+
+ struct vring_desc *desc;
+
+ struct vring_avail *avail;
+
+ struct vring_used *used;
+
+};
+
+
+
+/* The standard layout for the ring is a continuous chunk of
+memory which
+
+ * looks like this. We assume num is a power of 2.
+
+ *
+
+ * struct vring {
+
+ * // The actual descriptors (16 bytes each)
+
+ * struct vring_desc desc[num];
+
+ *
+
+ * // A ring of available descriptor heads with free-running
+index.
+
+ * __u16 avail_flags;
+
+ * __u16 avail_idx;
+
+ * __u16 available[num];
+
+ *
+
+ * // Padding to the next align boundary.
+
+ * char pad[];
+
+ *
+
+ * // A ring of used descriptor heads with free-running
+index.
+
+ * __u16 used_flags;
+
+ * __u16 EVENT_IDX;
+
+ * struct vring_used_elem used[num];
+
+ * };
+
+ * Note: for virtio PCI, align is 4096.
+
+ */
+
+static inline void vring_init(struct vring *vr, unsigned int num,
+void *p,
+
+ unsigned long align)
+
+{
+
+ vr->num = num;
+
+ vr->desc = p;
+
+ vr->avail = p + num*sizeof(struct vring_desc);
+
+ vr->used = (void *)(((unsigned long)&vr->avail->ring[num]
+
+ + align-1)
+
+ & ~(align - 1));
+
+}
+
+
+
+static inline unsigned vring_size(unsigned int num, unsigned long
+align)
+
+{
+
+ return ((sizeof(struct vring_desc)*num +
+sizeof(uint16_t)*(2+num)
+
+ + align - 1) & ~(align - 1))
+
+ + sizeof(uint16_t)*3 + sizeof(struct
+vring_used_elem)*num;
+
+}
+
+
+
+static inline int vring_need_event(uint16_t event_idx, uint16_t
+new_idx, uint16_t old_idx)
+
+{
+
+ return (uint16_t)(new_idx - event_idx - 1) <
+(uint16_t)(new_idx - old_idx);
+
+}
+
+#endif /* VIRTIO_RING_H */
+
+<cha:Reserved-Feature-Bits>Appendix B: Reserved Feature Bits
+
+Currently there are five device-independent feature bits defined:
+
+ VIRTIO_F_NOTIFY_ON_EMPTY (24) Negotiating this feature
+ indicates that the driver wants an interrupt if the device runs
+ out of available descriptors on a virtqueue, even though
+ interrupts are suppressed using the VRING_AVAIL_F_NO_INTERRUPT
+ flag or the used_event field. An example of this is the
+ networking driver: it doesn't need to know every time a packet
+ is transmitted, but it does need to free the transmitted
+ packets a finite time after they are transmitted. It can avoid
+ using a timer if the device interrupts it when all the packets
+ are transmitted.
+
+ VIRTIO_F_RING_INDIRECT_DESC (28) Negotiating this feature
+ indicates that the driver can use descriptors with the
+ VRING_DESC_F_INDIRECT flag set, as described in [sub:Indirect-Descriptors]
+ .
+
+ VIRTIO_F_RING_EVENT_IDX(29) This feature enables the used_event
+ and the avail_event fields. If set, it indicates that the
+ device should ignore the flags field in the available ring
+ structure. Instead, the used_event field in this structure is
+ used by guest to suppress device interrupts. Further, the
+ driver should ignore the flags field in the used ring
+ structure. Instead, the avail_event field in this structure is
+ used by the device to suppress notifications. If unset, the
+ driver should ignore the used_event field; the device should
+ ignore the avail_event field; the flags field is used
+
+ VIRTIO_F_BAD_FEATURE(30) This feature should never be
+ negotiated by the guest; doing so is an indication that the
+ guest is faulty[footnote:
+An experimental virtio PCI driver contained in Linux version
+2.6.25 had this problem, and this feature bit can be used to
+detect it.
+]
+
+ VIRTIO_F_FEATURES_HIGH(31) This feature indicates that the
+ device supports feature bits 32:63. If unset, feature bits
+ 32:63 are unset.
+
+Appendix C: Network Device
+
+The virtio network device is a virtual ethernet card, and is the
+most complex of the devices supported so far by virtio. It has
+enhanced rapidly and demonstrates clearly how support for new
+features should be added to an existing device. Empty buffers are
+placed in one virtqueue for receiving packets, and outgoing
+packets are enqueued into another for transmission in that order.
+A third command queue is used to control advanced filtering
+features.
+
+ Configuration
+
+ Subsystem Device ID 1
+
+ Virtqueues 0:receiveq. 1:transmitq. 2:controlq[footnote:
+Only if VIRTIO_NET_F_CTRL_VQ set
+]
+
+ Feature bits
+
+ VIRTIO_NET_F_CSUM (0) Device handles packets with partial
+ checksum
+
+ VIRTIO_NET_F_GUEST_CSUM (1) Guest handles packets with partial
+ checksum
+
+ VIRTIO_NET_F_MAC (5) Device has given MAC address.
+
+ VIRTIO_NET_F_GSO (6) (Deprecated) device handles packets with
+ any GSO type.[footnote:
+It was supposed to indicate segmentation offload support, but
+upon further investigation it became clear that multiple bits
+were required.
+]
+
+ VIRTIO_NET_F_GUEST_TSO4 (7) Guest can receive TSOv4.
+
+ VIRTIO_NET_F_GUEST_TSO6 (8) Guest can receive TSOv6.
+
+ VIRTIO_NET_F_GUEST_ECN (9) Guest can receive TSO with ECN.
+
+ VIRTIO_NET_F_GUEST_UFO (10) Guest can receive UFO.
+
+ VIRTIO_NET_F_HOST_TSO4 (11) Device can receive TSOv4.
+
+ VIRTIO_NET_F_HOST_TSO6 (12) Device can receive TSOv6.
+
+ VIRTIO_NET_F_HOST_ECN (13) Device can receive TSO with ECN.
+
+ VIRTIO_NET_F_HOST_UFO (14) Device can receive UFO.
+
+ VIRTIO_NET_F_MRG_RXBUF (15) Guest can merge receive buffers.
+
+ VIRTIO_NET_F_STATUS (16) Configuration status field is
+ available.
+
+ VIRTIO_NET_F_CTRL_VQ (17) Control channel is available.
+
+ VIRTIO_NET_F_CTRL_RX (18) Control channel RX mode support.
+
+ VIRTIO_NET_F_CTRL_VLAN (19) Control channel VLAN filtering.
+
+ Device configuration layout Two configuration fields are
+ currently defined. The mac address field always exists (though
+ is only valid if VIRTIO_NET_F_MAC is set), and the status field
+ only exists if VIRTIO_NET_F_STATUS is set. Only one bit is
+ currently defined for the status field: VIRTIO_NET_S_LINK_UP. #define VIRTIO_NET_S_LINK_UP 1
+
+
+
+struct virtio_net_config {
+
+ u8 mac[6];
+
+ u16 status;
+
+};
+
+ Device Initialization
+
+ The initialization routine should identify the receive and
+ transmission virtqueues.
+
+ If the VIRTIO_NET_F_MAC feature bit is set, the configuration
+ space “mac” entry indicates the “physical” address of the the
+ network card, otherwise a private MAC address should be
+ assigned. All guests are expected to negotiate this feature if
+ it is set.
+
+ If the VIRTIO_NET_F_CTRL_VQ feature bit is negotiated, identify
+ the control virtqueue.
+
+ If the VIRTIO_NET_F_STATUS feature bit is negotiated, the link
+ status can be read from the bottom bit of the “status” config
+ field. Otherwise, the link should be assumed active.
+
+ The receive virtqueue should be filled with receive buffers.
+ This is described in detail below in “Setting Up Receive
+ Buffers”.
+
+ A driver can indicate that it will generate checksumless
+ packets by negotating the VIRTIO_NET_F_CSUM feature. This “
+ checksum offload” is a common feature on modern network cards.
+
+ If that feature is negotiated, a driver can use TCP or UDP
+ segmentation offload by negotiating the VIRTIO_NET_F_HOST_TSO4
+ (IPv4 TCP), VIRTIO_NET_F_HOST_TSO6 (IPv6 TCP) and
+ VIRTIO_NET_F_HOST_UFO (UDP fragmentation) features. It should
+ not send TCP packets requiring segmentation offload which have
+ the Explicit Congestion Notification bit set, unless the
+ VIRTIO_NET_F_HOST_ECN feature is negotiated.[footnote:
+This is a common restriction in real, older network cards.
+]
+
+ The converse features are also available: a driver can save the
+ virtual device some work by negotiating these features.[footnote:
+For example, a network packet transported between two guests on
+the same system may not require checksumming at all, nor
+segmentation, if both guests are amenable.
+] The VIRTIO_NET_F_GUEST_CSUM feature indicates that partially
+ checksummed packets can be received, and if it can do that then
+ the VIRTIO_NET_F_GUEST_TSO4, VIRTIO_NET_F_GUEST_TSO6,
+ VIRTIO_NET_F_GUEST_UFO and VIRTIO_NET_F_GUEST_ECN are the input
+ equivalents of the features described above. See “Receiving
+ Packets” below.
+
+ Device Operation
+
+Packets are transmitted by placing them in the transmitq, and
+buffers for incoming packets are placed in the receiveq. In each
+case, the packet itself is preceeded by a header:
+
+struct virtio_net_hdr {
+
+#define VIRTIO_NET_HDR_F_NEEDS_CSUM 1
+
+ u8 flags;
+
+#define VIRTIO_NET_HDR_GSO_NONE 0
+
+#define VIRTIO_NET_HDR_GSO_TCPV4 1
+
+#define VIRTIO_NET_HDR_GSO_UDP 3
+
+#define VIRTIO_NET_HDR_GSO_TCPV6 4
+
+#define VIRTIO_NET_HDR_GSO_ECN 0x80
+
+ u8 gso_type;
+
+ u16 hdr_len;
+
+ u16 gso_size;
+
+ u16 csum_start;
+
+ u16 csum_offset;
+
+/* Only if VIRTIO_NET_F_MRG_RXBUF: */
+
+ u16 num_buffers
+
+};
+
+The controlq is used to control device features such as
+filtering.
+
+ Packet Transmission
+
+Transmitting a single packet is simple, but varies depending on
+the different features the driver negotiated.
+
+ If the driver negotiated VIRTIO_NET_F_CSUM, and the packet has
+ not been fully checksummed, then the virtio_net_hdr's fields
+ are set as follows. Otherwise, the packet must be fully
+ checksummed, and flags is zero.
+
+ flags has the VIRTIO_NET_HDR_F_NEEDS_CSUM set,
+
+ <ite:csum_start-is-set>csum_start is set to the offset within
+ the packet to begin checksumming, and
+
+ csum_offset indicates how many bytes after the csum_start the
+ new (16 bit ones' complement) checksum should be placed.[footnote:
+For example, consider a partially checksummed TCP (IPv4) packet.
+It will have a 14 byte ethernet header and 20 byte IP header
+followed by the TCP header (with the TCP checksum field 16 bytes
+into that header). csum_start will be 14+20 = 34 (the TCP
+checksum includes the header), and csum_offset will be 16. The
+value in the TCP checksum field will be the sum of the TCP pseudo
+header, so that replacing it by the ones' complement checksum of
+the TCP header and body will give the correct result.
+]
+
+ <enu:If-the-driver>If the driver negotiated
+ VIRTIO_NET_F_HOST_TSO4, TSO6 or UFO, and the packet requires
+ TCP segmentation or UDP fragmentation, then the “gso_type”
+ field is set to VIRTIO_NET_HDR_GSO_TCPV4, TCPV6 or UDP.
+ (Otherwise, it is set to VIRTIO_NET_HDR_GSO_NONE). In this
+ case, packets larger than 1514 bytes can be transmitted: the
+ metadata indicates how to replicate the packet header to cut it
+ into smaller packets. The other gso fields are set:
+
+ hdr_len is a hint to the device as to how much of the header
+ needs to be kept to copy into each packet, usually set to the
+ length of the headers, including the transport header.[footnote:
+Due to various bugs in implementations, this field is not useful
+as a guarantee of the transport header size.
+]
+
+ gso_size is the size of the packet beyond that header (ie.
+ MSS).
+
+ If the driver negotiated the VIRTIO_NET_F_HOST_ECN feature, the
+ VIRTIO_NET_HDR_GSO_ECN bit may be set in “gso_type” as well,
+ indicating that the TCP packet has the ECN bit set.[footnote:
+This case is not handled by some older hardware, so is called out
+specifically in the protocol.
+]
+
+ If the driver negotiated the VIRTIO_NET_F_MRG_RXBUF feature,
+ the num_buffers field is set to zero.
+
+ The header and packet are added as one output buffer to the
+ transmitq, and the device is notified of the new entry (see [sub:Notifying-The-Device]
+ ).[footnote:
+Note that the header will be two bytes longer for the
+VIRTIO_NET_F_MRG_RXBUF case.
+]
+
+ Packet Transmission Interrupt
+
+Often a driver will suppress transmission interrupts using the
+VRING_AVAIL_F_NO_INTERRUPT flag (see [sub:Receiving-Used-Buffers]
+) and check for used packets in the transmit path of following
+packets. However, it will still receive interrupts if the
+VIRTIO_F_NOTIFY_ON_EMPTY feature is negotiated, indicating that
+the transmission queue is completely emptied.
+
+The normal behavior in this interrupt handler is to retrieve and
+new descriptors from the used ring and free the corresponding
+headers and packets.
+
+ Setting Up Receive Buffers
+
+It is generally a good idea to keep the receive virtqueue as
+fully populated as possible: if it runs out, network performance
+will suffer.
+
+If the VIRTIO_NET_F_GUEST_TSO4, VIRTIO_NET_F_GUEST_TSO6 or
+VIRTIO_NET_F_GUEST_UFO features are used, the Guest will need to
+accept packets of up to 65550 bytes long (the maximum size of a
+TCP or UDP packet, plus the 14 byte ethernet header), otherwise
+1514 bytes. So unless VIRTIO_NET_F_MRG_RXBUF is negotiated, every
+buffer in the receive queue needs to be at least this length [footnote:
+Obviously each one can be split across multiple descriptor
+elements.
+].
+
+If VIRTIO_NET_F_MRG_RXBUF is negotiated, each buffer must be at
+least the size of the struct virtio_net_hdr.
+
+ Packet Receive Interrupt
+
+When a packet is copied into a buffer in the receiveq, the
+optimal path is to disable further interrupts for the receiveq
+(see [sub:Receiving-Used-Buffers]) and process packets until no
+more are found, then re-enable them.
+
+Processing packet involves:
+
+ If the driver negotiated the VIRTIO_NET_F_MRG_RXBUF feature,
+ then the “num_buffers” field indicates how many descriptors
+ this packet is spread over (including this one). This allows
+ receipt of large packets without having to allocate large
+ buffers. In this case, there will be at least “num_buffers” in
+ the used ring, and they should be chained together to form a
+ single packet. The other buffers will not begin with a struct
+ virtio_net_hdr.
+
+ If the VIRTIO_NET_F_MRG_RXBUF feature was not negotiated, or
+ the “num_buffers” field is one, then the entire packet will be
+ contained within this buffer, immediately following the struct
+ virtio_net_hdr.
+
+ If the VIRTIO_NET_F_GUEST_CSUM feature was negotiated, the
+ VIRTIO_NET_HDR_F_NEEDS_CSUM bit in the “flags” field may be
+ set: if so, the checksum on the packet is incomplete and the “
+ csum_start” and “csum_offset” fields indicate how to calculate
+ it (see [ite:csum_start-is-set]).
+
+ If the VIRTIO_NET_F_GUEST_TSO4, TSO6 or UFO options were
+ negotiated, then the “gso_type” may be something other than
+ VIRTIO_NET_HDR_GSO_NONE, and the “gso_size” field indicates the
+ desired MSS (see [enu:If-the-driver]).Control Virtqueue
+
+The driver uses the control virtqueue (if VIRTIO_NET_F_VTRL_VQ is
+negotiated) to send commands to manipulate various features of
+the device which would not easily map into the configuration
+space.
+
+All commands are of the following form:
+
+struct virtio_net_ctrl {
+
+ u8 class;
+
+ u8 command;
+
+ u8 command-specific-data[];
+
+ u8 ack;
+
+};
+
+
+
+/* ack values */
+
+#define VIRTIO_NET_OK 0
+
+#define VIRTIO_NET_ERR 1
+
+The class, command and command-specific-data are set by the
+driver, and the device sets the ack byte. There is little it can
+do except issue a diagnostic if the ack byte is not
+VIRTIO_NET_OK.
+
+ Packet Receive Filtering
+
+If the VIRTIO_NET_F_CTRL_RX feature is negotiated, the driver can
+send control commands for promiscuous mode, multicast receiving,
+and filtering of MAC addresses.
+
+Note that in general, these commands are best-effort: unwanted
+packets may still arrive.
+
+ Setting Promiscuous Mode
+
+#define VIRTIO_NET_CTRL_RX 0
+
+ #define VIRTIO_NET_CTRL_RX_PROMISC 0
+
+ #define VIRTIO_NET_CTRL_RX_ALLMULTI 1
+
+The class VIRTIO_NET_CTRL_RX has two commands:
+VIRTIO_NET_CTRL_RX_PROMISC turns promiscuous mode on and off, and
+VIRTIO_NET_CTRL_RX_ALLMULTI turns all-multicast receive on and
+off. The command-specific-data is one byte containing 0 (off) or
+1 (on).
+
+ Setting MAC Address Filtering
+
+struct virtio_net_ctrl_mac {
+
+ u32 entries;
+
+ u8 macs[entries][ETH_ALEN];
+
+};
+
+
+
+#define VIRTIO_NET_CTRL_MAC 1
+
+ #define VIRTIO_NET_CTRL_MAC_TABLE_SET 0
+
+The device can filter incoming packets by any number of
+destination MAC addresses.[footnote:
+Since there are no guarentees, it can use a hash filter
+orsilently switch to allmulti or promiscuous mode if it is given
+too many addresses.
+] This table is set using the class VIRTIO_NET_CTRL_MAC and the
+command VIRTIO_NET_CTRL_MAC_TABLE_SET. The command-specific-data
+is two variable length tables of 6-byte MAC addresses. The first
+table contains unicast addresses, and the second contains
+multicast addresses.
+
+ VLAN Filtering
+
+If the driver negotiates the VIRTION_NET_F_CTRL_VLAN feature, it
+can control a VLAN filter table in the device.
+
+#define VIRTIO_NET_CTRL_VLAN 2
+
+ #define VIRTIO_NET_CTRL_VLAN_ADD 0
+
+ #define VIRTIO_NET_CTRL_VLAN_DEL 1
+
+Both the VIRTIO_NET_CTRL_VLAN_ADD and VIRTIO_NET_CTRL_VLAN_DEL
+command take a 16-bit VLAN id as the command-specific-data.
+
+Appendix D: Block Device
+
+The virtio block device is a simple virtual block device (ie.
+disk). Read and write requests (and other exotic requests) are
+placed in the queue, and serviced (probably out of order) by the
+device except where noted.
+
+ Configuration
+
+ Subsystem Device ID 2
+
+ Virtqueues 0:requestq.
+
+ Feature bits
+
+ VIRTIO_BLK_F_BARRIER (0) Host supports request barriers.
+
+ VIRTIO_BLK_F_SIZE_MAX (1) Maximum size of any single segment is
+ in “size_max”.
+
+ VIRTIO_BLK_F_SEG_MAX (2) Maximum number of segments in a
+ request is in “seg_max”.
+
+ VIRTIO_BLK_F_GEOMETRY (4) Disk-style geometry specified in “
+ geometry”.
+
+ VIRTIO_BLK_F_RO (5) Device is read-only.
+
+ VIRTIO_BLK_F_BLK_SIZE (6) Block size of disk is in “blk_size”.
+
+ VIRTIO_BLK_F_SCSI (7) Device supports scsi packet commands.
+
+ VIRTIO_BLK_F_FLUSH (9) Cache flush command support.
+
+
+
+ Device configuration layout The capacity of the device
+ (expressed in 512-byte sectors) is always present. The
+ availability of the others all depend on various feature bits
+ as indicated above. struct virtio_blk_config {
+
+ u64 capacity;
+
+ u32 size_max;
+
+ u32 seg_max;
+
+ struct virtio_blk_geometry {
+
+ u16 cylinders;
+
+ u8 heads;
+
+ u8 sectors;
+
+ } geometry;
+
+ u32 blk_size;
+
+
+
+};
+
+ Device Initialization
+
+ The device size should be read from the “capacity”
+ configuration field. No requests should be submitted which goes
+ beyond this limit.
+
+ If the VIRTIO_BLK_F_BLK_SIZE feature is negotiated, the
+ blk_size field can be read to determine the optimal sector size
+ for the driver to use. This does not effect the units used in
+ the protocol (always 512 bytes), but awareness of the correct
+ value can effect performance.
+
+ If the VIRTIO_BLK_F_RO feature is set by the device, any write
+ requests will fail.
+
+
+
+ Device Operation
+
+The driver queues requests to the virtqueue, and they are used by
+the device (not necessarily in order). Each request is of form:
+
+struct virtio_blk_req {
+
+
+
+ u32 type;
+
+ u32 ioprio;
+
+ u64 sector;
+
+ char data[][512];
+
+ u8 status;
+
+};
+
+If the device has VIRTIO_BLK_F_SCSI feature, it can also support
+scsi packet command requests, each of these requests is of form:struct virtio_scsi_pc_req {
+
+ u32 type;
+
+ u32 ioprio;
+
+ u64 sector;
+
+ char cmd[];
+
+ char data[][512];
+
+#define SCSI_SENSE_BUFFERSIZE 96
+
+ u8 sense[SCSI_SENSE_BUFFERSIZE];
+
+ u32 errors;
+
+ u32 data_len;
+
+ u32 sense_len;
+
+ u32 residual;
+
+ u8 status;
+
+};
+
+The type of the request is either a read (VIRTIO_BLK_T_IN), a
+write (VIRTIO_BLK_T_OUT), a scsi packet command
+(VIRTIO_BLK_T_SCSI_CMD or VIRTIO_BLK_T_SCSI_CMD_OUT[footnote:
+the SCSI_CMD and SCSI_CMD_OUT types are equivalent, the device
+does not distinguish between them
+]) or a flush (VIRTIO_BLK_T_FLUSH or VIRTIO_BLK_T_FLUSH_OUT[footnote:
+the FLUSH and FLUSH_OUT types are equivalent, the device does not
+distinguish between them
+]). If the device has VIRTIO_BLK_F_BARRIER feature the high bit
+(VIRTIO_BLK_T_BARRIER) indicates that this request acts as a
+barrier and that all preceeding requests must be complete before
+this one, and all following requests must not be started until
+this is complete. Note that a barrier does not flush caches in
+the underlying backend device in host, and thus does not serve as
+data consistency guarantee. Driver must use FLUSH request to
+flush the host cache.
+
+#define VIRTIO_BLK_T_IN 0
+
+#define VIRTIO_BLK_T_OUT 1
+
+#define VIRTIO_BLK_T_SCSI_CMD 2
+
+#define VIRTIO_BLK_T_SCSI_CMD_OUT 3
+
+#define VIRTIO_BLK_T_FLUSH 4
+
+#define VIRTIO_BLK_T_FLUSH_OUT 5
+
+#define VIRTIO_BLK_T_BARRIER 0x80000000
+
+The ioprio field is a hint about the relative priorities of
+requests to the device: higher numbers indicate more important
+requests.
+
+The sector number indicates the offset (multiplied by 512) where
+the read or write is to occur. This field is unused and set to 0
+for scsi packet commands and for flush commands.
+
+The cmd field is only present for scsi packet command requests,
+and indicates the command to perform. This field must reside in a
+single, separate read-only buffer; command length can be derived
+from the length of this buffer.
+
+Note that these first three (four for scsi packet commands)
+fields are always read-only: the data field is either read-only
+or write-only, depending on the request. The size of the read or
+write can be derived from the total size of the request buffers.
+
+The sense field is only present for scsi packet command requests,
+and indicates the buffer for scsi sense data.
+
+The data_len field is only present for scsi packet command
+requests, this field is deprecated, and should be ignored by the
+driver. Historically, devices copied data length there.
+
+The sense_len field is only present for scsi packet command
+requests and indicates the number of bytes actually written to
+the sense buffer.
+
+The residual field is only present for scsi packet command
+requests and indicates the residual size, calculated as data
+length - number of bytes actually transferred.
+
+The final status byte is written by the device: either
+VIRTIO_BLK_S_OK for success, VIRTIO_BLK_S_IOERR for host or guest
+error or VIRTIO_BLK_S_UNSUPP for a request unsupported by host:#define VIRTIO_BLK_S_OK 0
+
+#define VIRTIO_BLK_S_IOERR 1
+
+#define VIRTIO_BLK_S_UNSUPP 2
+
+Historically, devices assumed that the fields type, ioprio and
+sector reside in a single, separate read-only buffer; the fields
+errors, data_len, sense_len and residual reside in a single,
+separate write-only buffer; the sense field in a separate
+write-only buffer of size 96 bytes, by itself; the fields errors,
+data_len, sense_len and residual in a single write-only buffer;
+and the status field is a separate read-only buffer of size 1
+byte, by itself.
+
+Appendix E: Console Device
+
+The virtio console device is a simple device for data input and
+output. A device may have one or more ports. Each port has a pair
+of input and output virtqueues. Moreover, a device has a pair of
+control IO virtqueues. The control virtqueues are used to
+communicate information between the device and the driver about
+ports being opened and closed on either side of the connection,
+indication from the host about whether a particular port is a
+console port, adding new ports, port hot-plug/unplug, etc., and
+indication from the guest about whether a port or a device was
+successfully added, port open/close, etc.. For data IO, one or
+more empty buffers are placed in the receive queue for incoming
+data and outgoing characters are placed in the transmit queue.
+
+ Configuration
+
+ Subsystem Device ID 3
+
+ Virtqueues 0:receiveq(port0). 1:transmitq(port0), 2:control
+ receiveq[footnote:
+Ports 2 onwards only if VIRTIO_CONSOLE_F_MULTIPORT is set
+], 3:control transmitq, 4:receiveq(port1), 5:transmitq(port1),
+ ...
+
+ Feature bits
+
+ VIRTIO_CONSOLE_F_SIZE (0) Configuration cols and rows fields
+ are valid.
+
+ VIRTIO_CONSOLE_F_MULTIPORT(1) Device has support for multiple
+ ports; configuration fields nr_ports and max_nr_ports are
+ valid and control virtqueues will be used.
+
+ Device configuration layout The size of the console is supplied
+ in the configuration space if the VIRTIO_CONSOLE_F_SIZE feature
+ is set. Furthermore, if the VIRTIO_CONSOLE_F_MULTIPORT feature
+ is set, the maximum number of ports supported by the device can
+ be fetched.struct virtio_console_config {
+
+ u16 cols;
+
+ u16 rows;
+
+
+
+ u32 max_nr_ports;
+
+};
+
+ Device Initialization
+
+ If the VIRTIO_CONSOLE_F_SIZE feature is negotiated, the driver
+ can read the console dimensions from the configuration fields.
+
+ If the VIRTIO_CONSOLE_F_MULTIPORT feature is negotiated, the
+ driver can spawn multiple ports, not all of which may be
+ attached to a console. Some could be generic ports. In this
+ case, the control virtqueues are enabled and according to the
+ max_nr_ports configuration-space value, the appropriate number
+ of virtqueues are created. A control message indicating the
+ driver is ready is sent to the host. The host can then send
+ control messages for adding new ports to the device. After
+ creating and initializing each port, a
+ VIRTIO_CONSOLE_PORT_READY control message is sent to the host
+ for that port so the host can let us know of any additional
+ configuration options set for that port.
+
+ The receiveq for each port is populated with one or more
+ receive buffers.
+
+ Device Operation
+
+ For output, a buffer containing the characters is placed in the
+ port's transmitq.[footnote:
+Because this is high importance and low bandwidth, the current
+Linux implementation polls for the buffer to be used, rather than
+waiting for an interrupt, simplifying the implementation
+significantly. However, for generic serial ports with the
+O_NONBLOCK flag set, the polling limitation is relaxed and the
+consumed buffers are freed upon the next write or poll call or
+when a port is closed or hot-unplugged.
+]
+
+ When a buffer is used in the receiveq (signalled by an
+ interrupt), the contents is the input to the port associated
+ with the virtqueue for which the notification was received.
+
+ If the driver negotiated the VIRTIO_CONSOLE_F_SIZE feature, a
+ configuration change interrupt may occur. The updated size can
+ be read from the configuration fields.
+
+ If the driver negotiated the VIRTIO_CONSOLE_F_MULTIPORT
+ feature, active ports are announced by the host using the
+ VIRTIO_CONSOLE_PORT_ADD control message. The same message is
+ used for port hot-plug as well.
+
+ If the host specified a port `name', a sysfs attribute is
+ created with the name filled in, so that udev rules can be
+ written that can create a symlink from the port's name to the
+ char device for port discovery by applications in the guest.
+
+ Changes to ports' state are effected by control messages.
+ Appropriate action is taken on the port indicated in the
+ control message. The layout of the structure of the control
+ buffer and the events associated are:struct virtio_console_control {
+
+ uint32_t id; /* Port number */
+
+ uint16_t event; /* The kind of control event */
+
+ uint16_t value; /* Extra information for the event */
+
+};
+
+
+
+/* Some events for the internal messages (control packets) */
+
+
+
+#define VIRTIO_CONSOLE_DEVICE_READY 0
+
+#define VIRTIO_CONSOLE_PORT_ADD 1
+
+#define VIRTIO_CONSOLE_PORT_REMOVE 2
+
+#define VIRTIO_CONSOLE_PORT_READY 3
+
+#define VIRTIO_CONSOLE_CONSOLE_PORT 4
+
+#define VIRTIO_CONSOLE_RESIZE 5
+
+#define VIRTIO_CONSOLE_PORT_OPEN 6
+
+#define VIRTIO_CONSOLE_PORT_NAME 7
+
+Appendix F: Entropy Device
+
+The virtio entropy device supplies high-quality randomness for
+guest use.
+
+ Configuration
+
+ Subsystem Device ID 4
+
+ Virtqueues 0:requestq.
+
+ Feature bits None currently defined
+
+ Device configuration layout None currently defined.
+
+ Device Initialization
+
+ The virtqueue is initialized
+
+ Device Operation
+
+When the driver requires random bytes, it places the descriptor
+of one or more buffers in the queue. It will be completely filled
+by random data by the device.
+
+Appendix G: Memory Balloon Device
+
+The virtio memory balloon device is a primitive device for
+managing guest memory: the device asks for a certain amount of
+memory, and the guest supplies it (or withdraws it, if the device
+has more than it asks for). This allows the guest to adapt to
+changes in allowance of underlying physical memory. If the
+feature is negotiated, the device can also be used to communicate
+guest memory statistics to the host.
+
+ Configuration
+
+ Subsystem Device ID 5
+
+ Virtqueues 0:inflateq. 1:deflateq. 2:statsq.[footnote:
+Only if VIRTIO_BALLON_F_STATS_VQ set
+]
+
+ Feature bits
+
+ VIRTIO_BALLOON_F_MUST_TELL_HOST (0) Host must be told before
+ pages from the balloon are used.
+
+ VIRTIO_BALLOON_F_STATS_VQ (1) A virtqueue for reporting guest
+ memory statistics is present.
+
+ Device configuration layout Both fields of this configuration
+ are always available. Note that they are little endian, despite
+ convention that device fields are guest endian:struct virtio_balloon_config {
+
+ u32 num_pages;
+
+ u32 actual;
+
+};
+
+ Device Initialization
+
+ The inflate and deflate virtqueues are identified.
+
+ If the VIRTIO_BALLOON_F_STATS_VQ feature bit is negotiated:
+
+ Identify the stats virtqueue.
+
+ Add one empty buffer to the stats virtqueue and notify the
+ host.
+
+Device operation begins immediately.
+
+ Device Operation
+
+ Memory Ballooning The device is driven by the receipt of a
+ configuration change interrupt.
+
+ The “num_pages” configuration field is examined. If this is
+ greater than the “actual” number of pages, memory must be given
+ to the balloon. If it is less than the “actual” number of
+ pages, memory may be taken back from the balloon for general
+ use.
+
+ To supply memory to the balloon (aka. inflate):
+
+ The driver constructs an array of addresses of unused memory
+ pages. These addresses are divided by 4096[footnote:
+This is historical, and independent of the guest page size
+] and the descriptor describing the resulting 32-bit array is
+ added to the inflateq.
+
+ To remove memory from the balloon (aka. deflate):
+
+ The driver constructs an array of addresses of memory pages it
+ has previously given to the balloon, as described above. This
+ descriptor is added to the deflateq.
+
+ If the VIRTIO_BALLOON_F_MUST_TELL_HOST feature is set, the
+ guest may not use these requested pages until that descriptor
+ in the deflateq has been used by the device.
+
+ Otherwise, the guest may begin to re-use pages previously given
+ to the balloon before the device has acknowledged their
+ withdrawl. [footnote:
+In this case, deflation advice is merely a courtesy
+]
+
+ In either case, once the device has completed the inflation or
+ deflation, the “actual” field of the configuration should be
+ updated to reflect the new number of pages in the balloon.[footnote:
+As updates to configuration space are not atomic, this field
+isn't particularly reliable, but can be used to diagnose buggy
+guests.
+]
+
+ Memory Statistics
+
+The stats virtqueue is atypical because communication is driven
+by the device (not the driver). The channel becomes active at
+driver initialization time when the driver adds an empty buffer
+and notifies the device. A request for memory statistics proceeds
+as follows:
+
+ The device pushes the buffer onto the used ring and sends an
+ interrupt.
+
+ The driver pops the used buffer and discards it.
+
+ The driver collects memory statistics and writes them into a
+ new buffer.
+
+ The driver adds the buffer to the virtqueue and notifies the
+ device.
+
+ The device pops the buffer (retaining it to initiate a
+ subsequent request) and consumes the statistics.
+
+ Memory Statistics Format Each statistic consists of a 16 bit
+ tag and a 64 bit value. Both quantities are represented in the
+ native endian of the guest. All statistics are optional and the
+ driver may choose which ones to supply. To guarantee backwards
+ compatibility, unsupported statistics should be omitted.
+
+ struct virtio_balloon_stat {
+
+#define VIRTIO_BALLOON_S_SWAP_IN 0
+
+#define VIRTIO_BALLOON_S_SWAP_OUT 1
+
+#define VIRTIO_BALLOON_S_MAJFLT 2
+
+#define VIRTIO_BALLOON_S_MINFLT 3
+
+#define VIRTIO_BALLOON_S_MEMFREE 4
+
+#define VIRTIO_BALLOON_S_MEMTOT 5
+
+ u16 tag;
+
+ u64 val;
+
+} __attribute__((packed));
+
+ Tags
+
+ VIRTIO_BALLOON_S_SWAP_IN The amount of memory that has been
+ swapped in (in bytes).
+
+ VIRTIO_BALLOON_S_SWAP_OUT The amount of memory that has been
+ swapped out to disk (in bytes).
+
+ VIRTIO_BALLOON_S_MAJFLT The number of major page faults that
+ have occurred.
+
+ VIRTIO_BALLOON_S_MINFLT The number of minor page faults that
+ have occurred.
+
+ VIRTIO_BALLOON_S_MEMFREE The amount of memory not being used
+ for any purpose (in bytes).
+
+ VIRTIO_BALLOON_S_MEMTOT The total amount of memory available
+ (in bytes).
+