Fixing Issues after upgrading Proxmox 7 to 8

My initial plan was to update all of my Proxmox nodes to the latest version by the end of this year. While most updates proceeded smoothly, I encountered two errors on one particular node.
Given that updating servers is a critical operation, especially when they are only remotely accessible via the network, I decided to document these errors and their solutions for future reference.

Proxmox Host does not come online again, after the reboot due to an update

The first issue arose after the mandatory reboot; the server failed to restart. Upon requesting a remote console connection, the boot process stalled with the following error message:

Boot stuck on “A start job is running for Network initialization (XXm / no limit)”

After consulting various posts on the Proxmox forum, I initially suspected a need to update my network configuration. However, my attempts proved unsuccessful, leading to multiple reboots into rescue mode.

Fortunately, I had the insight to consult the official “Proxmox upgrade 7 to 8” guide, where I ultimately discovered the solution to my issue:

Network Setup Hangs on Boot Due to NTPsec Hook

It appears that a bug may lead to the simultaneous installation of both ntpsec and ntpsec-ntpdate. This, in turn, causes the network to fail during boot, resulting in a hang.

The resolution involves disabling the ntpsec-ntpdate start script using the command chmod -x /etc/network/if-up.d/ntpsec-ntpdate and then rebooting, successfully resolving the issue.”

A container does not start and shows the error

The next issues happens with some containers, that don’t want to startup anymore.
The Proxmox UI displays the following error:

run_buffer: 322 Script exited with status 255  
lxc_init: 844 Failed to run lxc.hook.pre-start for container "105"  
__lxc_start: 2027 Failed to initialize container "105"  
TASK ERROR: startup for container '105' failed

After I started the container manually via the terminal, I got a more specific error:

root /etc/pve/lxc # lxc-start -n 105

lxc-start: 105: ../src/lxc/lxccontainer.c: wait_on_daemonized_start: 870 No such file or directory - Failed to receive the container state

lxc-start: 105: ../src/lxc/tools/lxc_start.c: main: 306 The container failed to start

lxc-start: 105: ../src/lxc/tools/lxc_start.c: main: 309 To get more details, run the container in foreground mode

lxc-start: 105: ../src/lxc/tools/lxc_start.c: main: 311 Additional information can be obtained by setting the --logfile and --logpriority options

root /etc/pve/lxc # lxc-start -n 105 -F

lxc-start: 105: ../src/lxc/conf.c: run_buffer: 322 Script exited with status 255

lxc-start: 105: ../src/lxc/start.c: lxc_init: 844 Failed to run lxc.hook.pre-start for container "105"

lxc-start: 105: ../src/lxc/start.c: __lxc_start: 2027 Failed to initialize container "105"

lxc-start: 105: ../src/lxc/conf.c: run_buffer: 322 Script exited with status 1

lxc-start: 105: ../src/lxc/start.c: lxc_end: 985 Failed to run lxc.hook.post-stop for container "105"

lxc-start: 105: ../src/lxc/tools/lxc_start.c: main: 306 The container failed to start

lxc-start: 105: ../src/lxc/tools/lxc_start.c: main: 311 Additional information can be obtained by setting the --logfile and --logpriority options

Trying to mount the container disk also produces some more errors:

root /etc/pve/lxc # pct mount 105

mount: /var/lib/lxc/105/rootfs: wrong fs type, bad option, bad superblock on /dev/loop17, missing codepage or helper program, or other error.
       dmesg(1) may have more information after failed mount system call.
mounting container failed
command 'mount -o noacl /dev/loop17 /var/lib/lxc/105/rootfs//' failed: exit code 32

So I initially though, that the filesystem might be corrupt, so I also did try to check it:

root /etc/pve/lxc # pct fsck 105

fsck from util-linux 2.38.1
/var/lib/vz/images/105/vm-105-disk-1.raw: clean, 373288/4194304 files, 8047185/16777216 blocks

[  713.133949] loop17: detected capacity change from 0 to 134217728
[  713.137988] ext4: Unknown parameter 'noacl'

The last info provided me with the right clue:
It seems, that with proxmox 8, that container config did change slightly. Re-setting the Disk ACL to default did eventually work.


After that, the container was able to startup again

Disk failure and suprises

Once in a while – and especially if you have a System with an uptime > 300d – HW tends to fail.

Good thing, if you have a Cluster, where you can do the maintance on one Node, while the import stuff is still running on the other ones. Also good to always have a Backup of the whole content, if a disk fails.

One word before I continue: Regarding Software-RAIDs: I had a big problem once with a HW RAID Controller going bonkers and spent a week to find another matching controller to get the data back. At least for redundant Servers it is okay for me to go with SW RAID (e.g. mdraid). And if you can, you should go with ZFS in any case :-).

Anyhow, if you see graphs like this one:

sda-week

You know that something goes terribly wrong.

Doing a quick check, states the obvious:

# cat /proc/mdstat
Personalities : [raid1]
md3 : active raid1 sda4[0](F) sdb4[1]
      1822442815 blocks super 1.2 [2/1] [_U]

md2 : active raid1 sda3[0](F) sdb3[1]
      1073740664 blocks super 1.2 [2/1] [_U]

md1 : active raid1 sda2[0](F) sdb2[1]
      524276 blocks super 1.2 [2/1] [_U]

md0 : active raid1 sda1[0](F) sdb1[1]
      33553336 blocks super 1.2 [2/1] [_U]

unused devices: <none>

So /dev/sda seems to be gone from the RAID. Let’s do the checks.

Hardware check

hdparm:

# hdparm -I /dev/sda

/dev/sda:
HDIO_DRIVE_CMD(identify) failed: Input/output error 

smartctl:

# smartctl -a /dev/sda
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-2.6.32-34-pve] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

Vendor:               /0:0:0:0
Product:
User Capacity:        600,332,565,813,390,450 bytes [600 PB]
Logical block size:   774843950 bytes
scsiModePageOffset: response length too short, resp_len=47 offset=50 bd_len=46
>> Terminate command early due to bad response to IEC mode page
A mandatory SMART command failed: exiting. To continue, add one or more '-T permissive' options.

/dev/sda is dead, Jim

Next thing is to schedule a Disk Replacement and moving all services to another host/prepare the host to shutdown for maintenance.

Preparation for disk replacement

Stop all running Containers:

# for VE in $(vzlist -Ha -o veid); do vzctl stop $VE; done

I also disabled the “start at boot” option to have a quick startup of the Proxmox Node.

Next: Remove the faulty disk from the md-RAID:

# mdadm /dev/md0 -r /dev/sda1
# mdadm /dev/md1 -r /dev/sda2
# mdadm /dev/md2 -r /dev/sda3
# mdadm /dev/md3 -r /dev/sda4

Shutting down the System.

… some guy in the DC moved to the server at the expected time and replaces the faulty disk …

After that, the system is online again.

  1. copy partition table from /dev/sdb to /dev/sda
    # sgdisk -R /dev/sda /dev/sdb
    
  2. recreate the GUID for /dev/sda
    # sgdisk -G /dev/sda
    

Then add /dev/sda to the RAID again.

# mdadm /dev/md0 -a /dev/sda1
# mdadm /dev/md1 -a /dev/sda2
# mdadm /dev/md2 -a /dev/sda3
# mdadm /dev/md3 -a /dev/sda4



# cat /proc/mdstat
Personalities : [raid1]
md3 : active raid1 sda4[2] sdb4[1]
      1822442815 blocks super 1.2 [2/1] [_U]
      [===>.................]  recovery = 16.5% (301676352/1822442815) finish=329.3min speed=76955K/sec

md2 : active raid1 sda3[2] sdb3[1]
      1073740664 blocks super 1.2 [2/2] [UU]

md1 : active raid1 sda2[2] sdb2[1]
      524276 blocks super 1.2 [2/1] [_U]
        resync=DELAYED

md0 : active raid1 sda1[2] sdb1[1]
      33553336 blocks super 1.2 [2/2] [UU]

unused devices: <none>

After nearly 12h, the resync was completed:

diskstats_iops-day

and then this happened:

# vzctl start 300
# vzctl enter 300
enter into CT 300 failed
Unable to open pty: No such file or directory

There are plenty of comments if you search for Unable to open pty: No such file or directory

But

# svzctl exec 300 /sbin/MAKEDEV tty
# vzctl exec 300 /sbin/MAKEDEV pty
# vzctl exec 300 mknod --mode=666 /dev/ptmx c 5 2

did not help:

# vzctl enter 300
enter into CT 300 failed
Unable to open pty: No such file or directory    

And

# strace -ff vzctl enter 300

produces a lot of garbage – meaning stacktraces that did not help to solve the problem.
Then we were finally able to enter the container:

# vzctl exec 300 mount -t devpts none /dev/pts

But having a look into the process list was quite devastating:

# vzctl exec 300 ps -A
  PID TTY          TIME CMD
    1 ?        00:00:00 init
    2 ?        00:00:00 kthreadd/300
    3 ?        00:00:00 khelper/300
  632 ?        00:00:00 ps    

That is not really what you expect when you have a look into the process list of a Mail-/Web-Server, isn’t it?
After looking araound into the system and searching through some configuration files, it became obvious, that there was a system update in the past, but someone forgot to install upstart. So that was easy, right?

# vzctl exec 300 apt-get install upstart
Reading package lists...
Building dependency tree...
Reading state information...
The following packages will be REMOVED:
  sysvinit
The following NEW packages will be installed:
  upstart
WARNING: The following essential packages will be removed.
This should NOT be done unless you know exactly what you are doing!
  sysvinit
0 upgraded, 1 newly installed, 1 to remove and 0 not upgraded.
Need to get 486 kB of archives.
After this operation, 851 kB of additional disk space will be used.
You are about to do something potentially harmful.
To continue type in the phrase 'Yes, do as I say!'
 ?] Yes, do as I say!

BUT:

Err http://ftp.debian.org/debian/ wheezy/main upstart amd64 1.6.1-1
  Could not resolve 'ftp.debian.org'
Failed to fetch http://ftp.debian.org/debian/pool/main/u/upstart/upstart_1.6.1-1_amd64.deb  Could not resolve     'ftp.debian.org'
E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?

No Network – doooh.

So… that was that. Another plan to: chroot. Let’s start:
First we need to shutdown the container – or at least what is running of it:

# vzctl stop 300
Stopping container ...
Container is unmounted    

Second, we have to mount a bunch of devices to the FS:

# mount -o bind /dev /var/lib/vz/private/300/dev
# mount -o bind /dev/shm /var/lib/vz/private/300/dev/shm
# mount -o bind /proc /var/lib/vz/private/300/proc
# mount -o bind /sys /var/lib/vz/private/300/sys

Then perform the chroot and the installtion itself:

# chroot /var/lib/vz/private/300 /bin/bash -i

# apt-get install upstart
# exit      

At last, umount all the things:

# umount -l /var/lib/vz/private/300/sys
# umount -l /var/lib/vz/private/300/proc
# umount -l /var/lib/vz/private/300/dev/shm
# umount -l /var/lib/vz/private/300/dev

If you have trouble, because some of the devices are busy, kill the processes you find out with:

# lsof /var/lib/vz/private/300/dev

Or just clean the whole thing

# lsof 2> /dev/null | egrep '/var/lib/vz/private/300'    

Try to umount again :-).

Now restarting the container again.

# Starting container ...
# Container is mounted
# Adding IP address(es): 10.10.10.130
# Setting CPU units: 1000
# Setting CPUs: 2
# Container start in progress...  

And finally:

 # vzctl enter 300
 root@300 #  
 root@300 # ps -A | wc -l
 142

And this looks a lot better \o/.