getting stuck during harvester 1.8.0 upgrade

another day, another stuck harvester upgrade

May 05, 2026

I tried to upgrade my two-node Harvester home lab from v1.7.1 to v1.8.0. It got stuck four times, each in a different way. This post collects the diagnoses and fixes end-to-end so the next person (probably me) can recover faster. And yes it is almost midnight. I just wrapped up.

Two earlier posts in this series cover individual stall modes that show up here too:

If you’re hitting one specific stall, jump to the matching section. If the upgrade has just refused to start at all, read in order — the 1.7.1 → 1.8.0 path tripped on each phase in sequence for me.

Stage 1 — Webhook refuses the upgrade: “managed chart harvester is not ready”

Click Upgrade in the UI, immediately get:

admission webhook "validator.harvesterhci.io" denied the request:
managed chart harvester is not ready, please wait for it to be ready or fix it

The webhook is reading ManagedChart/harvester in fleet-local. If it isn’t Ready, no upgrade. Check the state:

kubectl describe managedchart harvester -n fleet-local

In my case the conditions said:

ErrApplied(1) [Cluster fleet-local/local: release has no 13 version]

That message comes from Helm’s storage driver — Fleet asked for a specific Helm revision that no longer exists. Look at the actual Helm release secrets:

kubectl get secret -n harvester-system \
  -l owner=helm,name=harvester \
  --sort-by=.metadata.creationTimestamp \
  -o custom-columns=NAME:.metadata.name,VERSION:.metadata.labels.version,STATUS:.metadata.labels.status

Mine looked like this — note the gap from v9 to v14 and the stuck pending-upgrade:

NAME                                VERSION   STATUS
sh.helm.release.v1.harvester.v9     9         deployed
sh.helm.release.v1.harvester.v14    14        pending-upgrade

A prior failed upgrade left a pending-upgrade lock secret around. While that exists, Helm refuses any further operations (”another operation in progress”). Fleet was also internally tracking a different revision number (13) that never made it to a real secret.

Fix

Back up the stuck secret, delete it, then bounce the local fleet-agent so it drops its cached “rollback to 13” target and reconciles fresh:

# 1. Backup
kubectl get secret sh.helm.release.v1.harvester.v14 \
  -n harvester-system -o yaml \
  > ~/Downloads/harvester-helm-v14-backup.yaml

# 2. Delete the stuck pending-upgrade secret
kubectl delete secret sh.helm.release.v1.harvester.v14 -n harvester-system

# 3. Restart fleet-agent in the local cluster's agent namespace
kubectl rollout restart deployment fleet-agent -n cattle-fleet-local-system

After the new fleet-agent takes the lease, it does a real helm upgrade and creates v10 — not v13. Watch the secrets transition:

kubectl get secret -n harvester-system -l owner=helm,name=harvester \
  -o custom-columns=NAME:.metadata.name,VERSION:.metadata.labels.version,STATUS:.metadata.labels.status

You want to see v10 deployed and v9 superseded. The webhook will accept the upgrade once the ManagedChart shows readyClusters: 1/1.

Stage 2 — CDI importer pending: “scratch space required and none found”

Upgrade kicks off, downloads the upgrade ISO, then sits forever at Download Upgrade Image. The events on the upgrade PVC look like:

ExternalProvisioning  PersistentVolumeClaim hvst-upgrade-...  Waiting for a volume...
Provisioning          External provisioner is provisioning volume for claim ...
Provisioning          Assuming an external populator will provision the volume

The PVC has a CDI populator dataSource. CDI spins up an importer-prime-... pod, but it’s Pending with:

FailedScheduling: persistentvolumeclaim "prime-...-scratch" not found

And the parent PVC carries:

cdi.kubevirt.io/storage.condition.running.message: scratch space required and none found
cdi.kubevirt.io/storage.condition.running.reason:  Scratch space required

CDI uses a scratch PVC alongside the destination PVC for the import. If scratchSpaceStorageClass isn’t set, CDI fails closed instead of falling back to your default storage class. The CDIConfig object is reconciled from the parent CDI CR — patching CDIConfig directly gets reverted, you have to patch CDI.

Fix

# Patch the CDI CR (NOT cdiconfig)
kubectl patch cdi cdi --type=merge \
  -p '{"spec":{"config":{"scratchSpaceStorageClass":"harvester-longhorn"}}}'

# Verify it propagated
kubectl get cdiconfig config -o jsonpath='{.spec}{"\n"}{.status.scratchSpaceStorageClass}{"\n"}'

# Kick the stuck importer pod so CDI re-templates it with scratch volume
kubectl delete pod -n harvester-system <importer-prime-...>

Within seconds you should see a prime-<uuid>-scratch PVC bind and the importer pod transition to Running. Image preload phases on each node will follow.

Stage 3 — Pre-drained, but never advances

This is the same stall I wrote about previously — see Harvester Upgrade Stuck at Pre-Drained State — and it still bites on the 1.7.1 → 1.8.0 path.

The upgrade UI sits at Upgrading Node 0% with the target node showing Pre-drained. The Harvester controller keeps logging:

node <name> is in upgrade pre-draining state, try to detach unused volumes

It will loop on that line indefinitely. The blocker is a Longhorn PodDisruptionBudget protecting the instance-manager pod on the node — Kubernetes can’t drain past it.

Fix

export TARGET_NODE="name-of-node-that-is-stuck"

kubectl get pods \
  --namespace longhorn-system \
  --field-selector spec.nodeName=${TARGET_NODE} \
  -o custom-columns=":metadata.name" \
  --no-headers \
  | grep "instance-manager" \
  | while read pod; do
      kubectl delete poddisruptionbudget $pod -n longhorn-system
    done

Repeat per node. The drain proceeds, the node reboots into the new OS, and the state machine moves to Post-draining.

Stage 4 — Post-draining, upgrade-repo can’t reach 2 replicas

Same root cause as Harvester Upgrade Stuck at Post-Draining. The post-drain job tries to download harvester-iso/bundle/metadata.yaml from the in-cluster upgrade-repo service. The deployment is 1/2 ready and the request times out.

kubectl -n harvester-system get deployment -l harvesterhci.io/upgradeComponent=repo
# upgrade-repo-hvst-upgrade-...   1/2   2   1

kubectl -n harvester-system logs job/hvst-upgrade-...-post-drain-<node>
# curl: (18) end of response with 901 bytes missing

The upgrade ISO PVC was provisioned with numberOfReplicas: 1. That single replica landed on the node we just drained, so the volume goes Faulted/Detached the moment the node reboots. The repo pod on the surviving node has nothing to read from.

Fix

Two parts. Scale the repo deployment so the surviving replica isn’t blocked waiting for a second pod, and uncordon the rebooting node so its instance-manager can come back and re-attach the volume.

# Scale upgrade-repo down to 1 (the original blog fix)
kubectl -n harvester-system scale deployment \
  upgrade-repo-hvst-upgrade-<id> --replicas=1

# Uncordon the post-drained node so longhorn can recover the volume
kubectl uncordon <node-name>

I also bumped the upgrade ISO volume to numberOfReplicas: 2 in the Longhorn UI so a future drain doesn’t strand it. Once Longhorn reports the volume attached / healthy, the post-drain job’s curl succeeds, the node finishes its Waiting Reboot step, and the upgrade marches to completion.

What I’d do differently next time

Before clicking Upgrade, check kubectl describe managedchart harvester -n fleet-local and the helm release secrets. If anything is pending-upgrade or has gaps in revision numbers, fix it first.
Make sure cdi cdi -o yaml has spec.config.scratchSpaceStorageClass set. Default storage class is not a fallback for CDI scratch.
Pre-bump important Longhorn volumes (especially the upgrade ISO once it lands) to numberOfReplicas: 2 on a multi-node cluster — single-replica volumes turn drains into outages.

Same general pattern as the previous two posts: Harvester upgrades are a state machine, and most stalls are some upstream component (Helm, CDI, Longhorn) refusing to make a destructive call without a human nudging it.

Is it just me? Does anyone else always have upgrade issues?

Cheers,

Joe

Discussion about this post

Ready for more?