How to Fix "Job was active longer than specified deadline" During Upgrade to Harester-v1.1.2

A known issue but not the most straightforward fix

Apr 27, 2023

Recently I decided to upgrade my Harvester cluster from v1.1.1 to v1.1.2. Unfortunately, I ran into the following error:

The First Try

After doing some research I found the following issue on GitHub: [BUG] Upgrade stuck in upgrading first node: Job was active longer than specified deadline #2894

Browsing through the issue, I found the following solution:

Increase the job deadline before upgrading:
$ cat > /tmp/fix.yaml <<EOF
spec:
  values:
    systemUpgradeJobActiveDeadlineSeconds: "3600"
EOF

$ kubectl patch managedcharts.management.cattle.io local-managed-system-upgrade-controller --namespace fleet-local --patch-file=/tmp/fix.yaml --type merge
$ kubectl -n cattle-system rollout restart deploy/system-upgrade-controller
Delete the upgrade and start over again (This might need several times). Preload time will reduce due to some images are already in.
Manually preload images from ISO on each node first.
~ github.com/harvester/harvester/issues/2894

I tested this out. First, I SSHed into the master/parent node and ran the above. I restarted the upgrade process and waited. An hour later I checked, and it had failed again. It seems the error is either linked to slow HDD speeds not being able to write the image to the nodes before a timeout. The error could also be a timeout during the initial pull of the image.

The Solution

I decided to dig deeper. After doing more research I decided to spin up an HTTP server to serve the ISO files and the version.yaml file for the upgrade. This is called an air-gapped upgrade. You can see the bash script I used here.

The Apache web server serves the files from the /iso directory.

Since the script downloads the harvester images (need to update the script to reflect the new v1.1.2 version) the only thing I need to add now is the version.yaml file. The documentation says:

- Download the version file from release pages, for example, https://releases.rancher.com/harvester/{version}/version.yaml
~ docs.harvesterhci.io/dev/upgrade/automatic

My version.yaml file looks like this:

apiVersion: harvesterhci.io/v1beta1
kind: Version
metadata:
  name: v1.1.2
  namespace: harvester-system
spec:
  isoChecksum: '53e28927d83f2e02387e14e56210cc1c6f9f8b89bc702785e59ca75fbc97238949c63f0264d214ad683bbfd482f1c697933f6381fcfd05cceb48167a0c7cdba5'
  isoURL: http://<apache-web-server-ip>/iso/<name-of-iso>
  releaseDate: '20230425'

Now that we have everything ready to go, let’s remove the current failed upgrade:

kubectl get upgrade.harvesterhci.io -n harvester-system -l harvesterhci.io/latestUpgrade=true

kubectl delete upgrade.harvesterhci.io/<name-of-upgrade> -n harvester-system

Now we create the new upgrade:

kubectl create -f http://<apache-web-server-ip>/iso/version.yaml

An upgrade button should show up on the Harvester GUI Dashboard page. Now you should be able to upgrade without a problem.

As you can see, the upgrade is now progressing.

Now that the master/parent node has been upgraded along with the system services, the worker/child nodes can be upgraded.

We can now consider it a successful upgrade!

Cheers,

Joe

How to Fix "Job was active longer than specified deadline" During Upgrade to Harester-v1.1.2

A known issue but not the most straightforward fix

The First Try

The Solution

Discussion about this post