Journalctl & Obscure Proxmox Errors

Journalctl & Obscure Proxmox Errors
Photo by Nong / Unsplash

I was digging through logs with journalctl --since=yesterday to verify our Proxmox servers were good to go, and I saw plenty of nonsense. When in doubt, I tend to skim for the bright red things that say Error. There were more than zero, and some were a tad concerning.

Problem 1

One of our hosts, --which is not in a cluster, and runs concurrently with Proxmox Backup Server, as well as running a VM for VEEAM (For our VMWare VMs) and CubeBackup (For our Google Workspace data)– was struggling with vms randomly powering off. There was no obvious reason for this, so I assumed this was the RAID controller was fighting with ZFS. It had been set to passthrough mode, but nonetheless, this is a common source of headache.

It turns out it was a bug with the Network Interfaces! Things run fine for a time, then I start to see the journal flooded with these:

May 17 11:14:28 pve4 kernel: i40e 0000:3d:00.0: Error I40E_AQ_RC_ENOSPC, forcing overflow promiscuous on PF
May 17 11:14:28 pve4 kernel: i40e 0000:3d:00.0: Error I40E_AQ_RC_ENOSPC, forcing overflow promiscuous on PF
May 17 11:14:28 pve4 kernel: i40e 0000:3d:00.0: Error I40E_AQ_RC_ENOSPC, forcing overflow promiscuous on PF

Here is a relevant thread on the Proxmox forums

Ultimately the cause of this is a bug in the intel NICs used. They cannot handle being VLAN aware well, because they are overloaded by the large number of VLANs.

The solution

...is easy.

First we edit /etc/network/interfaces

Next we find the line for our physical interface, in my case eno1, and we add four lines:
iface eno1 inet manual
offload-rxvlan off
offload-txvlan off
offload-tso off
offload-rx-vlan-filter off

Then, we reboot.

This will completely disable hardware offloading, so the CPU handles all the VLAN stuff instead of the NIC. Cool, right??

Problem 2

Another weird one from the Journal:

May 17 12:10:50 pve3 kernel: kvm [2597464]: vcpu0, guest rIP: 0xfffff8023be21d52 vmx_set_msr: BTF|LBR in IA32_DEBUGCTLMSR 0x1, nop

The log was floooooooooded with these. Basically it means the VM is requesting a feature in the CPU that the CPU doesn't support. But that doesn't make any sense, because all our VMs are set to host cpu mode, which means all the flags (capabilities of the CPU) are passed through as is. Literally impossible for there to be a mismatch... almost impossible.

The log tells us what process is throwing this error. The PID is 2597464
Let's run ps and filter down based on that PID to see what we find...

ps -auwx | grep 2597464

This outputs a lot of crap, but a few lines down, we see this:

root 2597464 104 6.4 9482924 8446844 ? Sl 12:27 0:31 /usr/bin/kvm -id 122 -name mastercam...

This means the PID 2597464 is running VM 122, our Mastercam server.

Let's check the settings.

Well, our CPU is set to host. Do you see the problem?

We have a 2-socket server. We have a 2-socket VM. But NUMA is not enabled for this VM! Since we have SMP disabled in the BIOS, it's expected that our KVM take advantage of NUMA to do all the work itself in assigning memory locations and sockets/cores.

Check the NUMA box, hit ok. Reboot. Done.

Update: Not done.

As it turns out, no amount of cajoling would solve this problem, as it has to do with the RAID controller. This card, even in JBOD mode, does not behave correctly with ZFS. Clearly it is still doing some kind of IO buffering or something. There is also no IT-Mode firmware easily available for it. Instead of continuing to fight with it, or abandoning ZFS, I decided it was best to just buy a new controller.

LSI 9300-16i 16-Port 12Gb/s SAS/SATA HBA ZFS TrueNAS UnRAID IT Mode 16.00.12.00 - Picture 2 of 4

We picked up this guy, the LSI 9300-16i, on Ebay for $59.95, pre-flashed with IT-Mode firmware. It has exactly the same I/O as the previous card, 4x Mini SAS High Density (SFF-8643) supporting up to 16 drives. The 6-pin PCIe power cable is optional; We didn't need it. The only hiccup with swapping the cards was, the drive IDs changed and I ended up having to reinstall Proxmox entirely.