admin-guide/RAS/main.rst

1 .. SPDX-License-Identifier: GPL-2.0
14 Reliability, Availability and Serviceability (RAS) is a concept used on
18   is the probability that a system will produce correct outputs.
24   is the probability that a system is operational at a given time
31   is the simplicity and speed with which a system can be repaired or
34   * Generally measured on Mean Time Between Repair (MTBR)
37 -------------
39 In order to reduce systems downtime, a system should be capable of detecting
42 the system administrator to take the action of replacing a component before
43 it causes data loss or system downtime.
51   Self-Monitoring, Analysis and Reporting Technology (SMART).
54 to identify if the probability of hardware errors is increasing, and, on such
59 ---------------
61 Most mechanisms used on modern systems use technologies like Hamming
62 Codes that allow error correction when the number of errors on a bit packet
67 Also, sometimes an error occur on a component that it is not used. For
72 * **Correctable Error (CE)** - the error detection mechanism detected and
74   Kernel mechanisms allow the system administrator to consider them as fatal.
76 * **Uncorrected Error (UE)** - the amount of errors happened above the error
77   correction threshold, and the system was unable to auto-correct.
79 * **Fatal Error** - when an UE error happens on a critical component of the
80   system (for example, a piece of the Kernel got corrupted by an UE), the
83 * **Non-fatal Error** - when an UE error happens on an unused component,
84   like a CPU in power down state or an unused memory bank, the system may
88   Also, when an error happens on a userspace process, it is also possible to
91 The mechanism for handling non-fatal errors is usually complex and may
93 policy desired by the system administrator.
96 ------------------------------------
98 Just detecting a hardware flaw is usually not enough, as the system needs
108 DMI BIOS usually have a list of memory module labels, with can be obtained
109 using the ``dmidecode`` tool. For example, on a desktop machine, it shows::
117 		Locator: ChannelA-DIMM0
125 On the above example, a DDR4 SO-DIMM memory module is located at the
126 system's memory labeled as "BANK 0", as given by the *bank locator* field.
127 Please notice that, on such system, the *total width* is equal to the
128 *data width*. It means that such memory module doesn't have error
132 bank. On this example, from an older server, ``dmidecode`` shows::
150 There, the DDR3 RDIMM memory module is located at the system's memory labeled
152 memory module has 64 bits of *data width* and 72 bits of *total width*. So,
154 Such kind of memory is called Error-correcting code memory (ECC memory).
157 labels on their system's board to use exactly the same BIOS, meaning that
161 ----------
164 used for error correction. In the above example, a memory module has
173 on the memory modules.
176 ECC code used on write, producing a word with *data width* and a *syndrome*.
184 The information about the CE/UE errors is stored on some special registers
186 either by BIOS, by some special CPUs or by Linux EDAC driver. On x86 64
190 .. [#f1] Please notice that several memory controllers allow operation on a
191   mode called "Lock-Step", where it groups two memory modules together,
192   doing 128-bit reads/writes. That gives 16 bits for error correction, with
194   that, when an error happens, there's no way to know what memory module is
198   On such mode, the same data is written to two memory modules. At read,
199   the system checks both memory modules, in order to check if both provide
200   identical data. On such configuration, when an error happens, there's no
201   way to know what memory module is to blame. So, it has to blame both
202   memory modules (or 4 memory modules, if the system is also on Lock-step
208 EDAC - Error Detection And Correction
214    was "out-of-tree" and maintained at http://bluesmoke.sourceforge.net.
218    When the subsystem was pushed upstream for the first time, on
222 -------
224 The ``edac`` kernel module's goal is to detect and report hardware errors
225 that occur within the computer system running under linux.
228 ------
236 CE events only, the system can and will continue to operate as no data
241 and system panics.
244 -----------------------
249 This new device type allows for non-memory type of ECC hardware detectors
261 ----------------
267 There are several add-in adapters that do **not** follow the PCI specification
284 ----------
286 EDAC is composed of a "core" module (``edac_core.ko``) and several Memory
287 Controller (MC) driver modules. On a given system, the CORE is loaded
292 Thus, to "report" on what version a system is running, one must report
297 -------
302 hardware-specific modules and have the dependencies load the necessary
309 loads both the ``amd76x_edac.ko`` memory controller module and the
310 ``edac_mc.ko`` core module.
314 ---------------
317 lives in the /sys/devices/system/edac directory.
322 	mc	memory controller(s) system
323 	pci	PCI control and status system
329 ----------------------------
332 are laid out in a Chip-Select Row (``csrowX``) and Channel table (``chX``).
335 .. [#f4] Nowadays, the term DIMM (Dual In-line Memory Module) is widely
336   used to refer to a memory module, although there are other memory
337   packaging alternatives, like SO-DIMM, SIMM, etc. The UEFI
338   specification (Version 2.7) defines a memory module in the Common
345 typical value. Yet, the actual number of csrows depends on the layout of
346 a given motherboard, memory controller and memory module characteristics.
348 Dual channels allow for dual data length (e. g. 128 bits, on 64 bit systems)
350 for more than 2 channels, like Fully Buffered DIMMs (FB-DIMMs) memory
353 	+------------+-----------------------+
355 	+------------+-----------+-----------+
359 	+------------+-----------+-----------+
361 	+------------+-----------+-----------+
363 	+------------+-----------+-----------+
365 	+------------+-----------+-----------+
367 	+------------+-----------+-----------+
369 	+------------+-----------+-----------+
371 In the above example, there are 4 physical slots on the motherboard
374 	+---------+---------+
376 	+---------+---------+
378 	+---------+---------+
380 Labels for these slots are usually silk-screened on the motherboard.
382 channel 1. Notice that there are two csrows possible on a physical DIMM.
383 These csrows are allocated their csrow assignment based on the slot into
389 both csrow0 and csrow1 are populated. On the other hand, when 2 single
393 controllers don't have any logic to identify the memory module, see
398 ``/sys/devices/system/edac/mc``, each memory controller will be
404 		   |->mc0
405 		   |->mc1
406 		   |->mc2
414 		|->csrow0
415 		|->csrow2
416 		|->csrow3
421 order to have dual-channel mode be operational. Since both csrow2 and
429 -------------------
436 	Documentation/ABI/testing/sysfs-devices-edac
440 ----------------------------------
445 A typical EDAC system has the following structure under
446 ``/sys/devices/system/edac/``\ [#f6]_::
448 	/sys/devices/system/edac/
496 this ``X`` memory module:
498 - ``size`` - Total memory managed by this csrow attribute file
503 - ``dimm_ue_count`` - Uncorrectable Errors count attribute file
506 	errors that have occurred on this DIMM. If panic_on_ue is set
508 	will panic the system.
510 - ``dimm_ce_count`` - Correctable Errors count attribute file
513 	errors that have occurred on this DIMM. This count is very
516 	monitored for non-zero values and report such information
517 	to the system administrator.
519 - ``dimm_dev_type``  - Device type attribute file
522 	being utilized on this DIMM.
525 		- x1
526 		- x2
527 		- x4
528 		- x8
530 - ``dimm_edac_mode`` - EDAC Mode of operation attribute file
535 - ``dimm_label`` - memory module label control file
538 	to it. With this label in the module, when errors occur
539 	the output can provide the DIMM label in the system log.
549 - ``dimm_location`` - location of the memory module
552 	memory controller identifies the location of a memory module.
553 	Depending on the type of memory and memory controller, it
556 		- *csrow* and *channel* - used when the memory controller
557 		  doesn't identify a single DIMM - e. g. in ``rankX`` dir;
558 		- *branch*, *channel*, *slot* - typically used on FB-DIMM memory
560 		- *channel*, *slot* - used on Nehalem and newer Intel drivers.
562 - ``dimm_mem_type`` - Memory Type attribute file
565 	on this csrow. Normally, either buffered or unbuffered memory.
568 		- Registered-DDR
569 		- Unbuffered-DDR
571 .. [#f5] On some systems, the memory controller doesn't have any logic
572 …to identify the memory module. On such systems, the directory is called ``rankX`` and works on a s…
573   On modern Intel memory controllers, the memory controller identifies the
574   memory modules directly. On such systems, the directory is called ``dimmX``.
581 ----------------------
584 directories. As this API doesn't work properly for Rambus, FB-DIMMs and
592 - ``ue_count`` - Total Uncorrectable Errors count attribute file
595 	errors that have occurred on this csrow. If panic_on_ue is set
597 	will panic the system.
600 - ``ce_count`` - Total Correctable Errors count attribute file
603 	errors that have occurred on this csrow. This count is very
606 	monitored for non-zero values and report such information
607 	to the system administrator.
610 - ``size_mb`` - Total memory managed by this csrow attribute file
616 - ``mem_type`` - Memory Type attribute file
619 	on this csrow. Normally, either buffered or unbuffered memory.
622 		- Registered-DDR
623 		- Unbuffered-DDR
626 - ``edac_mode`` - EDAC Mode of operation attribute file
632 - ``dev_type`` - Device type attribute file
635 	being utilized on this DIMM.
638 		- x1
639 		- x2
640 		- x4
641 		- x8
644 - ``ch0_ce_count`` - Channel 0 CE Count attribute file
646 	This attribute file will display the count of CEs on this
650 - ``ch0_ue_count`` - Channel 0 UE Count attribute file
652 	This attribute file will display the count of UEs on this
656 - ``ch0_dimm_label`` - Channel 0 DIMM Label control file
660 	to it. With this label in the module, when errors occur
661 	the output can provide the DIMM label in the system log.
672 - ``ch1_ce_count`` - Channel 1 CE Count attribute file
675 	This attribute file will display the count of CEs on this
679 - ``ch1_ue_count`` - Channel 1 UE Count attribute file
682 	This attribute file will display the count of UEs on this
686 - ``ch1_dimm_label`` - Channel 1 DIMM Label control file
689 	to it. With this label in the module, when errors occur
690 	the output can provide the DIMM label in the system log.
701 System Logging
702 --------------
704 If logging for UEs and CEs is enabled, then system logs will contain
713 	+---------------------------------------+-------------+
717 	+---------------------------------------+-------------+
719 	+---------------------------------------+-------------+
721 	+---------------------------------------+-------------+
723 	+---------------------------------------+-------------+
726 	+---------------------------------------+-------------+
728 	+---------------------------------------+-------------+
730 	+---------------------------------------+-------------+
732 	+---------------------------------------+-------------+
734 	+---------------------------------------+-------------+
735 	| And then an optional, driver-specific |             |
738 	+---------------------------------------+-------------+
741 type, a notice of "no info" and then an optional, driver-specific error
746 ------------------------
748 On Header Type 00 devices, the primary status is looked at for any
749 parity error regardless of whether parity is enabled on the device or
750 not. (The spec indicates parity is generated in some cases). On Header
752 if parity occurred on the bus on the other side of the bridge.
756 -------------------
758 Under ``/sys/devices/system/edac/pci`` are control and attribute files as
762 - ``check_pci_parity`` - Enable/Disable PCI Parity checking control file
770 		echo "1" >/sys/devices/system/edac/pci/check_pci_parity
774 		echo "0" >/sys/devices/system/edac/pci/check_pci_parity
777 - ``pci_parity_count`` - Parity Count
783 Module parameters
784 -----------------
786 - ``edac_mc_panic_on_ue`` - Panic on UE control file
790 	occurs - it is indeterminate what was uncorrected and the operating
791 	system context might be so mangled that continuing will lead to further
797 		module/kernel parameter: edac_mc_panic_on_ue=[0|1]
801 		echo "1" > /sys/module/edac_core/parameters/edac_mc_panic_on_ue
804 - ``edac_mc_log_ue`` - Log UE control file
808 	are reported through the system message log system.  UE statistics
813 		module/kernel parameter: edac_mc_log_ue=[0|1]
817 		echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ue
820 - ``edac_mc_log_ce`` - Log CE control file
824 	errors are reported through the system message log system.
829 		module/kernel parameter: edac_mc_log_ce=[0|1]
833 		echo "1" > /sys/module/edac_core/parameters/edac_mc_log_ce
836 - ``edac_mc_poll_msec`` - Polling period control file
848 		module/kernel parameter: edac_mc_poll_msec=[0|1]
852 		echo "1000" > /sys/module/edac_core/parameters/edac_mc_poll_msec
855 - ``panic_on_pci_parity`` - Panic on PCI PARITY Error
862 	module/kernel parameter::
868 		echo "1" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe
872 		echo "0" > /sys/module/edac_core/parameters/edac_panic_on_pci_pe
877 ----------------
884 At the location ``/sys/devices/system/edac`` (sysfs) new edac_device devices
891 	/sys/devices/system/edac/test-instance
901 	panic_on_ue	boolean to ``panic`` the system if an UE is encountered
913 			One out-of-tree driver uses controls here to allow
921 ---------
926 	+----------------+
927 	| test-instance0 |
928 	+----------------+
940 ------
945 	+-------------+
946 	| test-block0 |
947 	+-------------+
962 	test-block-bits-0	for every POLL cycle this counter
964 	test-block-bits-1	every 10 cycles, this counter is bumped once,
965 				and test-block-bits-0 is set to 0
966 	test-block-bits-2	every 100 cycles, this counter is bumped once,
967 				and test-block-bits-1 is set to 0
968 	test-block-bits-3	every 1000 cycles, this counter is bumped once,
969 				and test-block-bits-2 is set to 0
974 	reset-counters		writing ANY thing to this control will
986 Usage of EDAC APIs on Nehalem and newer Intel CPUs
987 --------------------------------------------------
989 On older Intel architectures, the memory controller was part of the North
995 found on newer Intel CPUs, such as ``i7core_edac``, ``sb_edac`` and
1041    ``/sys/devices/system/edac/mc/mc?/``:
1043    - ``inject_addrmatch/*``:
1061 		echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm
1062 		echo 1 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank
1066 		echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm
1067 		echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank
1069    - ``inject_eccmask``:
1072    - ``inject_section``:
1079    - ``inject_type``:
1082 		bit 0 - repeat
1083 		bit 1 - ecc
1084 		bit 2 - parity
1086    - ``inject_enable``:
1091    Datasheet states that the error will only be generated after a write on an
1096    at socket 0, on any DIMM/address on channel 2::
1098 	echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/channel
1099 	echo 2 >/sys/devices/system/edac/mc/mc0/inject_type
1100 	echo 64 >/sys/devices/system/edac/mc/mc0/inject_eccmask
1101 	echo 3 >/sys/devices/system/edac/mc/mc0/inject_section
1102 	echo 1 >/sys/devices/system/edac/mc/mc0/inject_enable
1110 …EDAC MC0: UE row 0, channel-a= 0 channel-b= 0 labels "-": NON_FATAL (addr = 0x0075b980, socket=0, …
1115    uses those registers to report Corrected Errors on devices with Registered
1125      $ for i in /sys/devices/system/edac/mc/mc0/all_channel_counts/*; do echo $i; cat $i; done
1126 	/sys/devices/system/edac/mc/mc0/all_channel_counts/udimm0
1128 	/sys/devices/system/edac/mc/mc0/all_channel_counts/udimm1
1130 	/sys/devices/system/edac/mc/mc0/all_channel_counts/udimm2
1133    What happens here is that errors on different csrows, but at the same
1158 Reference documents used on ``amd64_edac``
1159 ------------------------------------------
1161 ``amd64_edac`` module is based on the following documents
1162 (available from http://support.amd.com/en-us/search/tech-docs):
1185 	  Models 30h-3Fh Processors
1189    :Link: http://support.amd.com/TechDocs/49125_15h_Models_30h-3Fh_BKDG.pdf
1192 	  Models 60h-6Fh Processors
1196    :Link: http://support.amd.com/TechDocs/50742_15h_Models_60h-6Fh_BKDG.pdf
1199 	  Models 00h-0Fh Processors
1210   - 7 Dec 2005
1211   - 17 Jul 2007	Updated
1215   - 05 Aug 2009	Nehalem interface
1216   - 26 Oct 2016 Converted to ReST and cleanups at the Nehalem section
1220   - Doug Thompson, Dave Jiang, Dave Peterson et al,
1221   - Mauro Carvalho Chehab
1222   - Borislav Petkov
1223   - original author: Thayne Harbaugh