Updated k10temp stopped with R7 3700 and kernel 4.18.0-348.2.1.el8_5.x86_64

Issues related to hardware problems
Post Reply
swallowtail
Posts: 112
Joined: 2009/04/18 04:48:27

Updated k10temp stopped with R7 3700 and kernel 4.18.0-348.2.1.el8_5.x86_64

Post by swallowtail » 2021/11/28 02:22:48

EDIT: also does not work with stock k10temp, not just groeck's version.

I've had CentOS 8 working for about 18m with a Ryzen 7 3700. I got temp monitoring working using k10temp from https://github.com/groeck/k10temp, and it's been reporting Tctl and Tdie no problems. Last week I updated to kernel 4.18.0-348.2.1.el8_5.x86_64, and it stopped working, reporting 0.0C for each, and filling logging with kernel traces.

Anybody else in a similar boat, fixed this, or otherwise come up with a workaround? I've reported on github, but unlikely they will change it.

I don't know what in the latest kernel update has broken it :(

Code: Select all

Nov 27 09:13:31 emp80 kernel: ------------[ cut here ]------------
Nov 27 09:13:31 emp80 kernel: Unable to find AMD Northbridge id for 0000:00:18.3
Nov 27 09:13:31 emp80 kernel: WARNING: CPU: 10 PID: 869830 at arch/x86/include/asm/amd_nb.h:100 read_tempreg_nb_f17+0x70/0x90 [k10temp]
Nov 27 09:13:31 emp80 kernel: Modules linked in: k10temp(OE) hwmon_vid vfat fat nft_counter ip_set_hash_net dm_mod vhost_net vhost vhost_iotlb tap tun bridge stp llc xt_set ip_set nft_compat nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables nfnetlink sunrpc ext4 mbcache jbd2 intel_rapl_msr intel_rapl_common edac_mce_amd kvm_amd kvm irqbypass raid456 crct10dif_pclmul async_raid6_recov crc32_pclmul async_memcpy async_pq async_xor ghash_clmulni_intel xor async_tx snd_hda_codec_hdmi snd_hda_intel snd_intel_dspcfg snd_intel_sdw_acpi snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm eeepc_wmi asus_wmi snd_timer sparse_keymap raid6_pq rapl snd rfkill pcspkr wmi_bmof i2c_piix4 ccp soundcore gpio_amdpt gpio_generic acpi_cpufreq xfs libcrc32c raid1 sd_mod sg nouveau drm_ttm_helper ttm video drm_kms_helper syscopyarea nvme sysfillrect ahci sysimgblt fb_sys_fops libahci drm crc32c_intel igb mxm_wmi libata nvme_core t10_pi dca i2c_algo_bit fuse asus_wmi_sensors(OE) wmi
Nov 27 09:13:31 emp80 kernel: [last unloaded: i2c_dev]
Nov 27 09:13:31 emp80 kernel: CPU: 10 PID: 869830 Comm: sensors Kdump: loaded Tainted: G        W  OE    --------- -  - 4.18.0-348.2.1.el8_5.x86_64 #1
Nov 27 09:13:31 emp80 kernel: Hardware name: System manufacturer System Product Name/ROG STRIX B450-F GAMING, BIOS 3103 06/17/2020
Nov 27 09:13:31 emp80 kernel: RIP: 0010:read_tempreg_nb_f17+0x70/0x90 [k10temp]
Nov 27 09:13:31 emp80 kernel: Code: 01 75 ca 8b 45 38 33 42 38 a8 f8 75 c0 0f b7 fb eb 1c 48 8b b5 20 01 00 00 48 85 f6 74 21 48 c7 c7 a0 01 50 c0 e8 59 cd fe f0 <0f> 0b 31 ff 5b 4c 89 e2 5d be 00 98 05 00 41 5c e9 ab ba f6 f0 48
Nov 27 09:13:31 emp80 kernel: RSP: 0018:ffffb39d473abdf8 EFLAGS: 00010282
Nov 27 09:13:31 emp80 kernel: RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000007
Nov 27 09:13:31 emp80 kernel: RDX: 0000000000000007 RSI: 0000000000000082 RDI: ffff9ed19ec96810
Nov 27 09:13:31 emp80 kernel: RBP: ffff9ec2c74c6000 R08: 00000000000c6340 R09: 00000000ffffffff
Nov 27 09:13:31 emp80 kernel: R10: 0000000000000001 R11: 0000000000000000 R12: ffffb39d473abe1c
Nov 27 09:13:31 emp80 kernel: R13: 0000000000000001 R14: ffff9ec2c9a2f600 R15: ffff9ec31085aa00
Nov 27 09:13:31 emp80 kernel: FS:  00007f930418f740(0000) GS:ffff9ed19ec80000(0000) knlGS:0000000000000000
Nov 27 09:13:31 emp80 kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Nov 27 09:13:31 emp80 kernel: CR2: 00007f93034172a0 CR3: 00000001ceadc000 CR4: 0000000000350ee0
Nov 27 09:13:31 emp80 kernel: Call Trace:
Nov 27 09:13:31 emp80 kernel: temp2_input_show+0x36/0x80 [k10temp]
Nov 27 09:13:31 emp80 kernel: dev_attr_show+0x1c/0x40
Nov 27 09:13:31 emp80 kernel: sysfs_kf_seq_show+0x9b/0x100
Nov 27 09:13:31 emp80 kernel: seq_read+0x163/0x420
Nov 27 09:13:31 emp80 kernel: vfs_read+0x91/0x140
Nov 27 09:13:31 emp80 kernel: ksys_read+0x4f/0xb0
Nov 27 09:13:31 emp80 kernel: do_syscall_64+0x5b/0x1a0
Nov 27 09:13:31 emp80 kernel: entry_SYSCALL_64_after_hwframe+0x65/0xca
Nov 27 09:13:31 emp80 kernel: RIP: 0033:0x7f9303a895a5
Nov 27 09:13:31 emp80 kernel: Code: fe ff ff 50 48 8d 3d 92 f7 09 00 e8 85 fe 01 00 0f 1f 44 00 00 f3 0f 1e fa 48 8d 05 f5 6f 2d 00 8b 00 85 c0 75 0f 31 c0 0f 05 <48> 3d 00 f0 ff ff 77 53 c3 66 90 41 54 49 89 d4 55 48 89 f5 53 89
Nov 27 09:13:31 emp80 kernel: RSP: 002b:00007fff8b131cd8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
Nov 27 09:13:31 emp80 kernel: RAX: ffffffffffffffda RBX: 000055a193e6a2c0 RCX: 00007f9303a895a5
Nov 27 09:13:31 emp80 kernel: RDX: 0000000000001000 RSI: 000055a193e6ccc0 RDI: 0000000000000003
Nov 27 09:13:31 emp80 kernel: RBP: 0000000000000d68 R08: 000055a193e6ccb0 R09: 0000000000000003
Nov 27 09:13:31 emp80 kernel: R10: 0000000000000001 R11: 0000000000000246 R12: 00007f9303d57880
Nov 27 09:13:31 emp80 kernel: R13: 00007f9303d583c0 R14: 0000000000000000 R15: 0000000000000000
Nov 27 09:13:31 emp80 kernel: ---[ end trace 82b0bfbf9b4ed035 ]---
Last edited by swallowtail on 2021/11/28 04:01:21, edited 1 time in total.

User avatar
TrevorH
Site Admin
Posts: 33191
Joined: 2009/09/24 10:40:56
Location: Brighton, UK

Re: Updated k10temp stopped with R7 3700 and kernel 4.18.0-348.2.1.el8_5.x86_64

Post by TrevorH » 2021/11/28 02:54:05

Do you rebuild the kernel module for every update?
The future appears to be RHEL or Debian. I think I'm going Debian.
Info for USB installs on http://wiki.centos.org/HowTos/InstallFromUSBkey
CentOS 5 and 6 are deadest, do not use them.
Use the FAQ Luke

swallowtail
Posts: 112
Joined: 2009/04/18 04:48:27

Re: Updated k10temp stopped with R7 3700 and kernel 4.18.0-348.2.1.el8_5.x86_64

Post by swallowtail » 2021/11/28 03:42:21

Yes. My normal process every kernel update is to re-run make / make install / modprobe k10temp using groeck's k10temp github repository... which "normally" recovers sensors Tdie and Tctl output after kernel upgrades lose them.

More troubleshooting after chatting with groeck at https://github.com/groeck/k10temp/issues/14:

- I reverted to the standard CentOS k10temp.ko.kz hwmon driver (copy from another updated CentOS system running the same kernel and re-running depmod)
- 'modinfo k10temp' now shows the stock CentOS-signed k10temp driver
- after 'modprobe k10temp', sensors now shows results (but incorrect) and /var/log/messages refills with trace errors

Code: Select all

k10temp-pci-00c3
Adapter: PCI adapter
Tctl:          +0.0°C
Tdie:          +0.0°C
Tccd1:        -49.0°C
Tccd2:        -49.0°C
Tccd3:        -49.0°C
Tccd4:        -49.0°C
Tccd5:        -49.0°C
Tccd6:        -49.0°C
Tccd7:        -49.0°C
Tccd8:        -49.0°C
So both stock and groeck's k10temp modules fail for me.

swallowtail
Posts: 112
Joined: 2009/04/18 04:48:27

Re: Updated k10temp stopped with R7 3700 and kernel 4.18.0-348.2.1.el8_5.x86_64

Post by swallowtail » 2021/11/28 04:23:14

OK, so I'm back to CentOS-signed k10temp as noted. I blacklisted it in /etc/modprobe.d, and rebooted. System comes back clean as expected, no errors. Sensors output as expected has no k10temp output.

If I modprobe k10temp and run sensors I immediately get pages of kernel traces, and the sensors output has Tdie 0°C, Tctl 0°C, and 8 x cores of -49°C.

Sensors-detect finds Driver `k10temp': * Chip `AMD Family 17h thermal sensors' (confidence: 9) and wants to load k10temp... and if I let it, back to the start with errors and no output.

I'm a bit stumped here :(

User avatar
TrevorH
Site Admin
Posts: 33191
Joined: 2009/09/24 10:40:56
Location: Brighton, UK

Re: Updated k10temp stopped with R7 3700 and kernel 4.18.0-348.2.1.el8_5.x86_64

Post by TrevorH » 2021/11/28 04:48:03

I'd suggest reporting the problems with the distro copy on bugzilla.redhat.com and if they fix that then perhaps the add-on might start working too.
The future appears to be RHEL or Debian. I think I'm going Debian.
Info for USB installs on http://wiki.centos.org/HowTos/InstallFromUSBkey
CentOS 5 and 6 are deadest, do not use them.
Use the FAQ Luke


swallowtail
Posts: 112
Joined: 2009/04/18 04:48:27

Re: Updated k10temp stopped with R7 3700 and kernel 4.18.0-348.2.1.el8_5.x86_64

Post by swallowtail » 2021/11/28 06:02:55

Update... between 8.4 and 8.5, Centos integrated kernel kernel-4.18.0-331.el8 (ttps://git.centos.org/rpms/kernel/c/eff38ce94f417944a493ce131c5df27f6aec40c8?branch=c8s).

That included changes from RHEL Bugzilla 1980072:

Code: Select all

+            - hwmon: (k10temp) Zen3 Ryzen Desktop CPUs support (David Arcari) [1980072]            
+            - hwmon: (k10temp) Remove support for displaying voltage and current on Zen CPUs (David Arcari) [1980072]            
+            - hwmon: (k10temp) Add support for Zen3 CPUs (David Arcari) [1980072]            
+            - hwmon: (k10temp) Take out debugfs code (David Arcari) [1980072]            
+            - hwmon: (k10temp) Define SVI telemetry and current factors for Zen2 CPUs (David Arcari) [1980072]            
+            - hwmon: (k10temp) Create common functions and macros for Zen CPU families (David Arcari) [1980072]            
+            - hwmon: (k10temp) Add AMD family 17h model 60h PCI match (David Arcari) [1980072]            
+            - hwmon: (k10temp) make some symbols static (David Arcari) [1980072]            
+            - hwmon: (k10temp) Reorganize and simplify temperature support detection (David Arcari) [1980072]            
+            - hwmon: (k10temp) Swap Tdie and Tctl on Family 17h CPUs (David Arcari) [1980072]            
+            - hwmon: (k10temp) Display up to eight sets of CCD temperatures (David Arcari) [1980072]            
+            - hwmon: (k10temp) Add debugfs support (David Arcari) [1980072]            
+            - hwmon: (k10temp) Don't show temperature limits on Ryzen (Zen) CPUs (David Arcari) [1980072]            
+            - hwmon: (k10temp) Show core and SoC current and voltages on Ryzen CPUs (David Arcari) [1980072]            
+            - hwmon: (k10temp) Report temperatures per CPU die (David Arcari) [1980072]            
+            - hmon: (k10temp) Convert to use devm_hwmon_device_register_with_info (David Arcari) [1980072]            
+            - hwmon: (k10temp) Use bitops (David Arcari) [1980072]            
+            - hwmon: (k10temp) Add support for AMD family 17h, model 70h CPUs (David Arcari) [1980072]            
+            - treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 56 (David Arcari) [1980072]            
+            - hwmon: (k10temp) Add Hygon Dhyana support (David Arcari) [1980072]            
+            - hwmon: (k10temp) Auto-convert to use SENSOR_DEVICE_ATTR_{RO, RW, WO} (David Arcari) [1980072]            
+            - hwmon: (k10temp) Support all Family 15h Model 6xh and Model 7xh processors (David Arcari) [1980072]            
+            - hwmon: k10temp: Support Threadripper 2920X, 2970WX; simplify offset table (David Arcari) [1980072]            
+            - hwmon: (k10temp) 27C Offset needed for Threadripper2 (David Arcari) [1980072]            
+            - x86/amd_nb: Add AMD family 17h model 60h PCI IDs (David Arcari) [1980072]            
+            - x86/amd_nb: Add PCI device IDs for family 17h, model 70h (David Arcari) [1980072]            
+            - x86/pci, x86/amd_nb: Add Hygon Dhyana support to PCI and northbridge (David Arcari) [1980072]            
+            - Revert "[hwmon] hwmon: (k10temp) Add support for Zen3 CPUs" (David Arcari) [1980072]
I'd put money on those changes being the break.

Post Reply