2

I've seen some worrying messages in dmesg lately.

Specifically bunch of:

[   19.367114] pcieport 0000:00:1c.5: PCIe Bus Error: severity=Corrected, type=Physical Layer, (Receiver ID)
[   19.367148] pcieport 0000:00:1c.5:   device [8086:9d15] error status/mask=00000081/00002000
[   19.367172] pcieport 0000:00:1c.5:    [ 0] Receiver Error         (First)
[   19.367192] pcieport 0000:00:1c.5:    [ 7] Bad DLLP    

And:

[   20.121489] ath10k_pci 0000:03:00.0: Unknown eventid: 118809
[   20.124485] ath10k_pci 0000:03:00.0: Unknown eventid: 90118

Or:

[   19.367213] pcieport 0000:00:1c.5: AER: Multiple Corrected error received: 0000:00:1c.5
[   19.367218] pcieport 0000:00:1c.5: can't find device of ID00e5

And most worryingly:

Nov 06 19:03:16 3c86-notebook kernel: ath10k_pci 0000:03:00.0: firmware crashed! (guid a62c787e-4709-4d94-a1a7-4e9357c2555a)
Nov 06 19:03:16 3c86-notebook kernel: ath10k_pci 0000:03:00.0: failed to get memcpy hi address for firmware address 4: -16
Nov 06 19:03:16 3c86-notebook kernel: ath10k_pci 0000:03:00.0: failed to read firmware dump area: -16

(This one happens roughly 50% of the time on boot)

All of these started appearing in a fairly short time (~2 weeks) And since all of them could be caused by a hardware failure, I am most worried, is there a software way to test all or most of the hardware?

(Apart from the firmware crash, which causes the wifi to stop working, I did not see any impact of the previous errors)

jasonwryan
  • 73,126
Meowxiik
  • 33
  • 1
  • 4
  • Could be the new kernel you have has issues with the old firmware you've installed. How did you install the firmware? – Fabby Nov 06 '18 at 18:32
  • I didn't manually, all firmware I have was in the Arch Linux base packages. – Meowxiik Nov 06 '18 at 18:36

3 Answers3

2

The most practical way of confirming its hardware is to boot to known-good software. For example, an old kernel. Old firmware would be good too — a Live CD/DVD you know works would be great.

Also, check your logs — are you sure it only started 2 weeks ago? Or did you only start noticing it then?

Also, at least if this a desktop PC (relatively easy to open & look at), take a moment to do a visual inspection of the hardware: are all the fans spinning? Are there any missing heatsinks (and are the heatsinks free of dust/lint)? Any bulging capacitors? Since there are a bunch of PCIe errors, if you're comfortable with hardware, you could also reseat all the PCIe cards.

[Actual test equipment to prove the existence of a hardware fault would likely cost substantially more than just replacing the computer.]

derobert
  • 109,670
1

The second and fourth sets of log messages are from the Atheros wireless drivers for your particular hardware. They could be caused by hardware issues, but they could also be caused by firmware problems. I've not dealt with stuff from this particular driver before, so I can't be much help on those.

The first and third sets are both from the PCI-e subsystem directly. Both are talking about corrected errors. I have dealt with these types of errors before, and I can say from experience that they almost always indicate a hardware issue of some sort (though it may not be bad hardware). The standard procedure I use when I come across this type of error is:

  • Double check that there are no missing heatsinks, that all the fans are running correctly, and that there is no dust buildup.
  • For each add-in card (not only the one showing the problems), remove the card and do the following (replacing the card if it fails at any point):
    • Inspect the contacts on the edge of the card for signs of corrosion or damage.
    • Inspect any electrolytic capacitors for signs of leakage.
    • Inspect any plastic cased components for signs of melting.
    • Inspect the whole board for burn marks, unusual discoloration, or other damage.
    • Verify that the board doesn't smell unusual, preferrably shortly after it's been powered. An odd smell is usually indicative of leaking capacitors or overheated components, and will usually be present even if there is no visible indication of such problems.
    • Inspect the slot on the mainboard from which the card was removed, looking for evidence of bent contacts, corrosion, or melting (a good magnifying glass is useful for this).
  • Double check the mainboard itself just like for the cards. If it fails to pass inspection, replace it.
  • Verify that the power supply has a sufficiently high rating for the system, and that it's actually supplying correct voltages. You can do a quick check for an unloaded power-supply having the correct voltages with just a simple DC multimeter. Checking that the rails don't sag when the PSU is loaded is a bit trickier, but a lot of good motherboards will have voltage monitoring built-in that you can check from the firmware setup menus.
  • If you have access to a thermal camera (a real one, not the gimmicky smartphone apps that simulate one), check the inside of the system while it's running. No single spot should show a temperature above about 85 celsius (this is the standard upper temperature limit for most consumer electronics).

On the plus side, you can be reasonably certain that the issue is specific to either the PCI express subsystem (and therefore is either a bad card or a bad mainboard), the power supply (though this is unlikely, if it were your power supply, you would probably be seeing other symptoms), or the firmware on the motherboard.

0

Some things to try to diagnose the problem. Load a different OS and see if the same errors pop up.(software problem) Try booting the old OS without some of the PCI cards.(hardware) Try rolling back the bios and see if the errors go away.(firmware) One of these will eliminate the error and then you will know what part of your machine was having issues. Keeping a list of the errors and seeing which errors go away or stay with the different changes can help you diagnose if you have multiple problems or only one. This one happens roughly 50% of the time on boot That makes it seem like it could be a hardware problem to me. Try opening the box and checking for loose cards/cables. Cleaning up any dust or running your setup in a cooler environment will all have positive effects on your experience. Good luck!