What causes this? pcieport 0000:00:03.0: PCIe Bus Error: AER / Bad TLP

Question

I'm seeing error messages like these below:

Nov 15 15:49:52 x99 kernel: pcieport 0000:00:03.0: AER: Multiple 
Corrected error received: id=0018 Nov 15 15:49:52 x99 kernel: pcieport
0000:00:03.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, 
id=0018(Receiver ID) Nov 15 15:49:52 x99 kernel: pcieport 0000:00:03.0: 
device [8086:6f08] error status/mask=00000040/00002000 Nov 15 15:49:52 
x99 kernel: pcieport 0000:00:03.0: [ 6] Bad TLP

These will cause degraded performance even though they have (so far) been corrected. Obviously, this issue needs to be resolved. However, I cannot find much about it on the Internet. (Maybe I'm looking in the wrong places.) I found only a few links which I will post below.

Does anyone know more about these errors?

Is it the motherboard, the Samsung 950 Pro, or the GPU (or some combination of these)?

The hardware is: Asus X99 Deluxe II Samsung 950 Pro NVMe in the M2. slot on the mb (which shares PCIe port 3). Nothing else is plugged into PCIe port 3. A GeForce GTX 1070 in PCIe slot 1 Core i7 6850K CPU

A couple of the links I found mentions the same hardware (X99 Deluxe II mb & Samsung950 Pro). I'm running Arch Linux.

I do not find the string "8086:6f08" in journalctl or anywhere else I have thought to search so far.

odd error message with nvme ssd (Bad TLP) : linuxquestions https://www.reddit.com/r/linuxquestions/comments/4walnu/odd_error_message_with_nvme_ssd_bad_tlp/

PCIe: Is your card silently struggling with TLP retransmits? http://billauer.co.il/blog/2011/07/pcie-tlp-dllp-retransmit-data-link-layer-error/

GTX 1080 Throwing Bad TLP PCIe Bus Errors - GeForce Forums https://forums.geforce.com/default/topic/957456/gtx-1080-throwing-bad-tlp-pcie-bus-errors/

drivers - PCIe error in dmesg log - Ask Ubuntu https://askubuntu.com/questions/643952/pcie-error-in-dmesg-log

780Ti X99 hard lock - PCIE errors - NVIDIA Developer Forums https://devtalk.nvidia.com/default/topic/779994/linux/780ti-x99-hard-lock-pcie-errors/

i moved my gtx 710 from th pcie x16 slot to a x1 slot (asus prime b450-plus, ryzen 5 3600, samsung nvme 970) — trants, Sep 27 '19 at 03:47
Useful resources related to this - https://bugzilla.redhat.com/show_bug.cgi?id=1616364 & https://www.kernel.org/doc/html/latest/admin-guide/kernel-parameters.html. — slm, Oct 12 '20 at 14:21

score 56 · Accepted Answer · answered Jun 04 '17 at 05:34

56

I can give at least a few details, even though I cannot fully explain what happens.

As described for example here, the CPU communicates with the PCIe bus controller by transaction layer packets (TLPs). The hardware detects when there are faulty ones, and the Linux kernel reports that as messages.

The kernel option pci=nommconf disables Memory-Mapped PCI Configuration Space, which is available in Linux since kernel 2.6. Very roughly, all PCI devices have an area that describe this device (which you see with lspci -vv), and the originally method to access this area involves going through I/O ports, while PCIe allows this space to be mapped to memory for simpler access.

That means in this particular case, something goes wrong when the PCIe controller uses this method to access the configuraton space of a particular device. It may be a hardware bug in the device, in the PCIe root controller on the motherboard, in the specific interaction of those two, or something else.

By using pci=nommconf, the configuration space of all devices will be accessed in the original way, and changing the access methods works around this problem. So if you want, it's both resolving and suppressing it.

answered Jun 04 '17 at 05:34

dirkt

32,309

Can I know if it is my motherboard problem? Or my CPU problem. Should I change them? – user10024395 Jun 14 '17 at 13:52
@user2675516: It's not CPU related. It's a problem of the PCIe root controller (which often is in the Southbridge) and/or the PCIe controller of the device, or their interaction. Yes, changing the motherboard for one with different hardware usually gets rid of it. – dirkt Jun 14 '17 at 15:45
I changed from asus e-ws to asus deluxe, but problem still persists. That's why i suspect it is the cpu. Or is it because both are X99 chipset? – user10024395 Jun 14 '17 at 15:47
2

@user2675516: If the chipset is the same, esp. the PCIe controller, then changing the motherboard of course won't help. That's why I wrote "motherboard with different hardware". – dirkt Jun 14 '17 at 16:02
the common factor for me seems to be a motherboard with the X99 chipset – MountainX Jul 04 '17 at 03:18
Or, maybe the common factor for me is all Asus motherboards... – MountainX Nov 25 '17 at 00:39
1

This worked the second time I tried it. I might have had some driver problems; my advice is if it doesn't work at first, checke that everything is up to date. Then re-do it. – will Oct 19 '21 at 23:23
1

Thanks, @dirkt Your pci=nommconf immediately cleared up my identical problem on an HP laptop model 17-bs019dx, https://support.hp.com/us-en/document/c05531010 Networking had been working flawlessly on its pre-installed Windows, and under the dual-booting Slackware 14.2x64 that I'd originally installed "alongside" it. But when I replaced 14.2 with Slackware-current64 just the other day, it started throwing all those error messages as soon as dhcpcd started. Couldn't figure out what to do until google coughed up your answer to the problem. – eigengrau Jun 24 '22 at 05:30

score 10 · Answer 2 · answered May 10 '18 at 01:40

10

I get the same errors (Bad TLP associated with device 8086:6f08). I have X99 Deluxe II, Samsung 960 pro, Nvidia 1080 ti. These problems seem to be associated with X99 chipset and M.2 device, like Samsung Pro.

The X99 Deluxe II motherboard shares bandwidth between PCIE16_3 slot and M.2/U.2. Following comment from @Nic, in the BIOS I changed Onboard Devices Configuration | U.2_2 Bandwidth from Auto to U.2_2. This fixed the problem for me.

answered May 10 '18 at 01:40

user1759557

201

How did you determine that it is just that chipset? Tried every other chipset? It occurs on a wide variety of hardware. – doug65536 Sep 25 '19 at 03:51
@user1759557, thanks for the hint! To fix this for mainboard TUF Z370-PLUS I went to system firmware (AKA "BIOS") configuration: Advanced menu > Onboard Devices Configuration > M.2_1 Configuration, and switched it from [Auto] to [PCIE mode] (I had WD Black SN850 M.2 NVMe SSD installed). Here is the manual page for the reference: https://dlcdnets.asus.com/pub/ASUS/mb/LGA1151/TUF_Z370-PLUS_GAMING/E13474_TUF_Z370_PLUS_GAMING_UM_v2_web.pdf#page=75 – saulius2 Dec 13 '22 at 18:35
PS. OK, I got it slightly wrong: the error occurs from time to time, but it's quite rare – 10 occurrences in an hour (under pretty heave I/O load directed through M.2). – saulius2 Dec 13 '22 at 19:27

score 8 · Answer 3 · answered Apr 19 '17 at 04:43

8

Adding the kernel command line option pci=nommconf resolved the issue for me. Therefore, I'm assume the issue is motherboard-related. It happens on all my X99 motherboard-equipped computers. It does not happen on Z170 systems or any other hardware I own.

answered Apr 19 '17 at 04:43

MountainX

17,948

1

Hi I am also facing this problem. Can I know what pci-nommconf do? Is it just suppressing the problem or resolving the problem? – user10024395 Jun 02 '17 at 10:02
Can't confirm - getting the error on z170i, running arch 4.13.12 – sitilge Nov 24 '17 at 21:35
@sitilge - thanks for your comment. Which brand/model z170i? My motherboards are Asus. One is X99 Deluxe II – MountainX Nov 25 '17 at 00:36
It is asus z170i pro gaming. – sitilge Nov 25 '17 at 11:04

score 4 · Answer 4 · answered Feb 07 '21 at 15:12

4

I had a similar experience with an Nvidia RTX2070 and a ROG STRIX B450-F GAMING motherboard. I solved configuring the specific supported PCIe generation type supported by the Nvidia card (GEN 2) in the bios and added this option on kernel boot parameters:

pcie_aspm=off

as reported on this site

Changing the PCI Generation didn't really solved it but I think could avoid problems.

answered Feb 07 '21 at 15:12

Zioalex

286

Hi @Alex, yes this is true. I had a smiliar problem, nommconf wasn't enough for me. When I use the proprietary NVIDIA driver I need to set pcie_aspm=off to get rid off the errors, on a Desktop this is no problem because it's not running on battery. Using pci=noaer will suppress the warnings solely which could be an option if pcie_aspm=off does not work. On a Laptop I would potentially not advise to disable ASPM. Problem can be circumvented by using the nouveau driver if the problematic device has been identified as the NVIDIA GPU via lspci -vt and dmesg. – stephanmg May 27 '21 at 10:58
Addition: Also by using pcie_aspm=off instead of simply suppressing warnings via pci=noaer I have the impression that my system got a lot more responsive, probably because there is no message spam anymore on the bus... Just a guess. – stephanmg May 27 '21 at 11:02

score 3 · Answer 5 · edited Jul 02 '18 at 03:18

I changed the PCIE16_3 slot Config in Bios on my x99-E to be static set to x8 mode instead of auto that is default for M.2 device support. Works fine now without TLP errors on both of my 1070GTX cards connected via PCIe 1x to 16x extension boards.

I did not use port 16_3 first, moved to that slot to test but still had issues before change in bios. Also changed bsleep setting for all cards to 30 in the miner config.

Before change I had the kernel log spammed with faults. Also tried to powercycle system before and after change. Seems to be pretty persistent.

score 2 · Answer 6 · edited Oct 20 '18 at 01:45

2

Search your motherboard manual for "AER". You can kill the source of the problem by either correcting the specific incompatibility or disabling AER altogether. Only use this if all the error spam concerns corrected errors, otherwise you could be covering up an actual issue.

edited Oct 20 '18 at 01:45

sebasth

14,872

answered Oct 19 '18 at 23:20

N3V3N

29

score 1 · Answer 7 · edited Jul 02 '18 at 03:20

1

Try this steps:

cp /etc/default/grub ~/Desktop
Edit grub. Add pci=noaer at the end of GRUB_CMDLINE_LINUX_DEFAULT. Line will be like this:
```
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash pci=noaer"
```
sudo cp ~/Desktop/grub /etc/default/
sudo update-grub
Reboot now

edited Jul 02 '18 at 03:20

slm

369,824

answered May 28 '18 at 02:51

Ehtesham

142

I applied your solution but instead of pci=noaer I used pci=nommconf as suggested by @dirkt – Megidd Jun 11 '18 at 05:39
Thanks, pci=noaer fixed my slackware 14.2x64 problem installed on an hp laptop (desktop install didn't exhibit this problem at all) – John Forkosh Jun 26 '18 at 21:38
10

Would you mind elaborating a bit? What does this option do and how do you expect it to solve the problem? – Calimo Nov 30 '18 at 13:31
1

Why would you just not use sudoedit for safe editing? -1 for these copy here and there steps are complete nonsense – Vlastimil Burián Feb 23 '19 at 20:07
23

pci=noaer just disables the Advanced Error Reporting. So you still have those errors, you just don't see them ... – dirkt Sep 23 '19 at 13:07

What causes this? pcieport 0000:00:03.0: PCIe Bus Error: AER / Bad TLP

7 Answers7

Linked