Lines Matching +full:al +full:- +full:mc +full:- +full:edac
1 .. SPDX-License-Identifier: GPL-2.0
37 -------------
51 Self-Monitoring, Analysis and Reporting Technology (SMART).
59 ---------------
72 * **Correctable Error (CE)** - the error detection mechanism detected and
76 * **Uncorrected Error (UE)** - the amount of errors happened above the error
77 correction threshold, and the system was unable to auto-correct.
79 * **Fatal Error** - when an UE error happens on a critical component of the
83 * **Non-fatal Error** - when an UE error happens on an unused component,
91 The mechanism for handling non-fatal errors is usually complex and may
96 ------------------------------------
117 Locator: ChannelA-DIMM0
125 On the above example, a DDR4 SO-DIMM memory module is located at the
154 Such kind of memory is called Error-correcting code memory (ECC memory).
161 ----------
186 either by BIOS, by some special CPUs or by Linux EDAC driver. On x86 64
191 mode called "Lock-Step", where it groups two memory modules together,
192 doing 128-bit reads/writes. That gives 16 bits for error correction, with
202 memory modules (or 4 memory modules, if the system is also on Lock-step
208 EDAC - Error Detection And Correction
214 was "out-of-tree" and maintained at http://bluesmoke.sourceforge.net.
219 Kernel 2.6.16, it was renamed to ``EDAC``.
222 -------
224 The ``edac`` kernel module's goal is to detect and report hardware errors
228 ------
244 -----------------------
246 A new feature for EDAC, the ``edac_device`` class of device, was added in
249 This new device type allows for non-memory type of ECC hardware detectors
261 ----------------
267 There are several add-in adapters that do **not** follow the PCI specification
274 the EDAC PCI scanning code. If that attribute is set, PCI parity/error
284 ----------
286 EDAC is composed of a "core" module (``edac_core.ko``) and several Memory
287 Controller (MC) driver modules. On a given system, the CORE is loaded
288 and one MC driver will be loaded. Both the CORE and the MC driver (or
293 both the CORE's and the MC driver's versions.
297 -------
299 If ``edac`` was statically linked with the kernel then no loading
300 is necessary. If ``edac`` was built as modules then simply modprobe
301 the ``edac`` pieces that you need. You should be able to modprobe
302 hardware-specific modules and have the dependencies load the necessary
314 ---------------
316 EDAC presents a ``sysfs`` interface for control and reporting purposes. It
317 lives in the /sys/devices/system/edac directory.
322 mc memory controller(s) system
328 Memory Controller (mc) Model
329 ----------------------------
331 Each ``mc`` device controls a set of memory modules [#f4]_. These modules
332 are laid out in a Chip-Select Row (``csrowX``) and Channel table (``chX``).
335 .. [#f4] Nowadays, the term DIMM (Dual In-line Memory Module) is widely
337 packaging alternatives, like SO-DIMM, SIMM, etc. The UEFI
340 (Type 17). Along this document, and inside the EDAC subsystem, the term
350 for more than 2 channels, like Fully Buffered DIMMs (FB-DIMMs) memory
353 +------------+-----------------------+
355 +------------+-----------+-----------+
359 +------------+-----------+-----------+
361 +------------+-----------+-----------+
363 +------------+-----------+-----------+
365 +------------+-----------+-----------+
367 +------------+-----------+-----------+
369 +------------+-----------+-----------+
374 +---------+---------+
376 +---------+---------+
378 +---------+---------+
380 Labels for these slots are usually silk-screened on the motherboard.
397 tree in EDAC's sysfs interface. Starting in directory
398 ``/sys/devices/system/edac/mc``, each memory controller will be
400 index of the MC::
402 ..../edac/mc/
404 |->mc0
405 |->mc1
406 |->mc2
412 .../mc/mc0/
414 |->csrow0
415 |->csrow2
416 |->csrow3
421 order to have dual-channel mode be operational. Since both csrow2 and
425 Within each of the ``mcX`` and ``csrowX`` directories are several EDAC
429 -------------------
431 In ``mcX`` directories are EDAC control and attribute files for
436 Documentation/ABI/testing/sysfs-devices-edac
440 ----------------------------------
442 The recommended way to use the EDAC subsystem is to look at the information
445 A typical EDAC system has the following structure under
446 ``/sys/devices/system/edac/``\ [#f6]_::
448 /sys/devices/system/edac/
449 ├── mc
495 In the ``dimmX`` directories are EDAC control and attribute files for
498 - ``size`` - Total memory managed by this csrow attribute file
503 - ``dimm_ue_count`` - Uncorrectable Errors count attribute file
507 this counter will not have a chance to increment, since EDAC
510 - ``dimm_ce_count`` - Correctable Errors count attribute file
516 monitored for non-zero values and report such information
519 - ``dimm_dev_type`` - Device type attribute file
525 - x1
526 - x2
527 - x4
528 - x8
530 - ``dimm_edac_mode`` - EDAC Mode of operation attribute file
535 - ``dimm_label`` - memory module label control file
549 - ``dimm_location`` - location of the memory module
556 - *csrow* and *channel* - used when the memory controller
557 doesn't identify a single DIMM - e. g. in ``rankX`` dir;
558 - *branch*, *channel*, *slot* - typically used on FB-DIMM memory
560 - *channel*, *slot* - used on Nehalem and newer Intel drivers.
562 - ``dimm_mem_type`` - Memory Type attribute file
568 - Registered-DDR
569 - Unbuffered-DDR
581 ----------------------
584 directories. As this API doesn't work properly for Rambus, FB-DIMMs and
588 In the ``csrowX`` directories are EDAC control and attribute files for
592 - ``ue_count`` - Total Uncorrectable Errors count attribute file
596 this counter will not have a chance to increment, since EDAC
600 - ``ce_count`` - Total Correctable Errors count attribute file
606 monitored for non-zero values and report such information
610 - ``size_mb`` - Total memory managed by this csrow attribute file
616 - ``mem_type`` - Memory Type attribute file
622 - Registered-DDR
623 - Unbuffered-DDR
626 - ``edac_mode`` - EDAC Mode of operation attribute file
632 - ``dev_type`` - Device type attribute file
638 - x1
639 - x2
640 - x4
641 - x8
644 - ``ch0_ce_count`` - Channel 0 CE Count attribute file
650 - ``ch0_ue_count`` - Channel 0 UE Count attribute file
656 - ``ch0_dimm_label`` - Channel 0 DIMM Label control file
672 - ``ch1_ce_count`` - Channel 1 CE Count attribute file
679 - ``ch1_ue_count`` - Channel 1 UE Count attribute file
686 - ``ch1_dimm_label`` - Channel 1 DIMM Label control file
702 --------------
707 …EDAC MC0: CE page 0x283, offset 0xce0, grain 8, syndrome 0x6ec3, row 0, channel 1 "DIMM_B1": amd76…
708 …EDAC MC0: CE page 0x1e5, offset 0xfb0, grain 8, syndrome 0xb741, row 0, channel 1 "DIMM_B1": amd76…
713 +---------------------------------------+-------------+
717 +---------------------------------------+-------------+
719 +---------------------------------------+-------------+
721 +---------------------------------------+-------------+
723 +---------------------------------------+-------------+
726 +---------------------------------------+-------------+
728 +---------------------------------------+-------------+
730 +---------------------------------------+-------------+
732 +---------------------------------------+-------------+
734 +---------------------------------------+-------------+
735 | And then an optional, driver-specific | |
738 +---------------------------------------+-------------+
741 type, a notice of "no info" and then an optional, driver-specific error
746 ------------------------
756 -------------------
758 Under ``/sys/devices/system/edac/pci`` are control and attribute files as
762 - ``check_pci_parity`` - Enable/Disable PCI Parity checking control file
770 echo "1" >/sys/devices/system/edac/pci/check_pci_parity
774 echo "0" >/sys/devices/system/edac/pci/check_pci_parity
777 - ``pci_parity_count`` - Parity Count
784 -----------------
786 - ``edac_mc_panic_on_ue`` - Panic on UE control file
790 occurs - it is indeterminate what was uncorrected and the operating
792 corruption. If the kernel has MCE configured, then EDAC will never
804 - ``edac_mc_log_ue`` - Log UE control file
820 - ``edac_mc_log_ce`` - Log CE control file
836 - ``edac_mc_poll_msec`` - Polling period control file
855 - ``panic_on_pci_parity`` - Panic on PCI PARITY Error
876 EDAC device type
877 ----------------
884 At the location ``/sys/devices/system/edac`` (sysfs) new edac_device devices
887 There is a three level tree beneath the above ``edac`` directory. For example,
891 /sys/devices/system/edac/test-instance
913 One out-of-tree driver uses controls here to allow
921 ---------
926 +----------------+
927 | test-instance0 |
928 +----------------+
940 ------
945 +-------------+
946 | test-block0 |
947 +-------------+
962 test-block-bits-0 for every POLL cycle this counter
964 test-block-bits-1 every 10 cycles, this counter is bumped once,
965 and test-block-bits-0 is set to 0
966 test-block-bits-2 every 100 cycles, this counter is bumped once,
967 and test-block-bits-1 is set to 0
968 test-block-bits-3 every 1000 cycles, this counter is bumped once,
969 and test-block-bits-2 is set to 0
974 reset-counters writing ANY thing to this control will
983 http://bluesmoke.sourceforge.net project site for EDAC.
986 Usage of EDAC APIs on Nehalem and newer Intel CPUs
987 --------------------------------------------------
992 controller (MC) inside the CPUs.
1008 Each MC have 3 physical read channels, 3 physical write channels and
1013 As EDAC API maps the minimum unity is csrows, the driver sequentially
1037 2) The MC has the ability to inject errors to test drivers. The drivers
1041 ``/sys/devices/system/edac/mc/mc?/``:
1043 - ``inject_addrmatch/*``:
1061 echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm
1062 echo 1 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank
1066 echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/dimm
1067 echo any >/sys/devices/system/edac/mc/mc0/inject_addrmatch/rank
1069 - ``inject_eccmask``:
1072 - ``inject_section``:
1079 - ``inject_type``:
1082 bit 0 - repeat
1083 bit 1 - ecc
1084 bit 2 - parity
1086 - ``inject_enable``:
1098 echo 2 >/sys/devices/system/edac/mc/mc0/inject_addrmatch/channel
1099 echo 2 >/sys/devices/system/edac/mc/mc0/inject_type
1100 echo 64 >/sys/devices/system/edac/mc/mc0/inject_eccmask
1101 echo 3 >/sys/devices/system/edac/mc/mc0/inject_section
1102 echo 1 >/sys/devices/system/edac/mc/mc0/inject_enable
1110 …EDAC MC0: UE row 0, channel-a= 0 channel-b= 0 labels "-": NON_FATAL (addr = 0x0075b980, socket=0, …
1125 $ for i in /sys/devices/system/edac/mc/mc0/all_channel_counts/*; do echo $i; cat $i; done
1126 /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm0
1128 /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm1
1130 /sys/devices/system/edac/mc/mc0/all_channel_counts/udimm2
1159 ------------------------------------------
1162 (available from http://support.amd.com/en-us/search/tech-docs):
1185 Models 30h-3Fh Processors
1189 :Link: http://support.amd.com/TechDocs/49125_15h_Models_30h-3Fh_BKDG.pdf
1192 Models 60h-6Fh Processors
1196 :Link: http://support.amd.com/TechDocs/50742_15h_Models_60h-6Fh_BKDG.pdf
1199 Models 00h-0Fh Processors
1210 - 7 Dec 2005
1211 - 17 Jul 2007 Updated
1215 - 05 Aug 2009 Nehalem interface
1216 - 26 Oct 2016 Converted to ReST and cleanups at the Nehalem section
1218 * EDAC authors/maintainers:
1220 - Doug Thompson, Dave Jiang, Dave Peterson et al,
1221 - Mauro Carvalho Chehab
1222 - Borislav Petkov
1223 - original author: Thayne Harbaugh