getting stuck during harvester 1.8.0 upgrade
another day, another stuck harvester upgrade
I tried to upgrade my two-node Harvester home lab from v1.7.1 to v1.8.0. It got stuck four times, each in a different way. This post collects the diagnoses and fixes end-to-end so the next person (probably me) can recover faster. And yes it is almost midnight. I just wrapped up.
Two earlier posts in this series cover individual stall modes that show up here too:
If you’re hitting one specific stall, jump to the matching section. If the upgrade has just refused to start at all, read in order — the 1.7.1 → 1.8.0 path tripped on each phase in sequence for me.
Stage 1 — Webhook refuses the upgrade: “managed chart harvester is not ready”
Click Upgrade in the UI, immediately get:
admission webhook "validator.harvesterhci.io" denied the request:
managed chart harvester is not ready, please wait for it to be ready or fix itThe webhook is reading ManagedChart/harvester in fleet-local. If it isn’t Ready, no upgrade. Check the state:
kubectl describe managedchart harvester -n fleet-localIn my case the conditions said:
ErrApplied(1) [Cluster fleet-local/local: release has no 13 version]That message comes from Helm’s storage driver — Fleet asked for a specific Helm revision that no longer exists. Look at the actual Helm release secrets:
kubectl get secret -n harvester-system \
-l owner=helm,name=harvester \
--sort-by=.metadata.creationTimestamp \
-o custom-columns=NAME:.metadata.name,VERSION:.metadata.labels.version,STATUS:.metadata.labels.statusMine looked like this — note the gap from v9 to v14 and the stuck pending-upgrade:
NAME VERSION STATUS
sh.helm.release.v1.harvester.v9 9 deployed
sh.helm.release.v1.harvester.v14 14 pending-upgradeA prior failed upgrade left a pending-upgrade lock secret around. While that exists, Helm refuses any further operations (”another operation in progress”). Fleet was also internally tracking a different revision number (13) that never made it to a real secret.
Fix
Back up the stuck secret, delete it, then bounce the local fleet-agent so it drops its cached “rollback to 13” target and reconciles fresh:
# 1. Backup
kubectl get secret sh.helm.release.v1.harvester.v14 \
-n harvester-system -o yaml \
> ~/Downloads/harvester-helm-v14-backup.yaml
# 2. Delete the stuck pending-upgrade secret
kubectl delete secret sh.helm.release.v1.harvester.v14 -n harvester-system
# 3. Restart fleet-agent in the local cluster's agent namespace
kubectl rollout restart deployment fleet-agent -n cattle-fleet-local-systemAfter the new fleet-agent takes the lease, it does a real helm upgrade and creates v10 — not v13. Watch the secrets transition:
kubectl get secret -n harvester-system -l owner=helm,name=harvester \
-o custom-columns=NAME:.metadata.name,VERSION:.metadata.labels.version,STATUS:.metadata.labels.statusYou want to see v10 deployed and v9 superseded. The webhook will accept the upgrade once the ManagedChart shows readyClusters: 1/1.
Stage 2 — CDI importer pending: “scratch space required and none found”
Upgrade kicks off, downloads the upgrade ISO, then sits forever at Download Upgrade Image. The events on the upgrade PVC look like:
ExternalProvisioning PersistentVolumeClaim hvst-upgrade-... Waiting for a volume...
Provisioning External provisioner is provisioning volume for claim ...
Provisioning Assuming an external populator will provision the volumeThe PVC has a CDI populator dataSource. CDI spins up an importer-prime-... pod, but it’s Pending with:
FailedScheduling: persistentvolumeclaim "prime-...-scratch" not foundAnd the parent PVC carries:
cdi.kubevirt.io/storage.condition.running.message: scratch space required and none found
cdi.kubevirt.io/storage.condition.running.reason: Scratch space requiredCDI uses a scratch PVC alongside the destination PVC for the import. If scratchSpaceStorageClass isn’t set, CDI fails closed instead of falling back to your default storage class. The CDIConfig object is reconciled from the parent CDI CR — patching CDIConfig directly gets reverted, you have to patch CDI.
Fix
# Patch the CDI CR (NOT cdiconfig)
kubectl patch cdi cdi --type=merge \
-p '{"spec":{"config":{"scratchSpaceStorageClass":"harvester-longhorn"}}}'
# Verify it propagated
kubectl get cdiconfig config -o jsonpath='{.spec}{"\n"}{.status.scratchSpaceStorageClass}{"\n"}'
# Kick the stuck importer pod so CDI re-templates it with scratch volume
kubectl delete pod -n harvester-system <importer-prime-...>Within seconds you should see a prime-<uuid>-scratch PVC bind and the importer pod transition to Running. Image preload phases on each node will follow.
Stage 3 — Pre-drained, but never advances
This is the same stall I wrote about previously — see Harvester Upgrade Stuck at Pre-Drained State — and it still bites on the 1.7.1 → 1.8.0 path.
The upgrade UI sits at Upgrading Node 0% with the target node showing Pre-drained. The Harvester controller keeps logging:
node <name> is in upgrade pre-draining state, try to detach unused volumesIt will loop on that line indefinitely. The blocker is a Longhorn PodDisruptionBudget protecting the instance-manager pod on the node — Kubernetes can’t drain past it.
Fix
export TARGET_NODE="name-of-node-that-is-stuck"
kubectl get pods \
--namespace longhorn-system \
--field-selector spec.nodeName=${TARGET_NODE} \
-o custom-columns=":metadata.name" \
--no-headers \
| grep "instance-manager" \
| while read pod; do
kubectl delete poddisruptionbudget $pod -n longhorn-system
doneRepeat per node. The drain proceeds, the node reboots into the new OS, and the state machine moves to Post-draining.
Stage 4 — Post-draining, upgrade-repo can’t reach 2 replicas
Same root cause as Harvester Upgrade Stuck at Post-Draining. The post-drain job tries to download harvester-iso/bundle/metadata.yaml from the in-cluster upgrade-repo service. The deployment is 1/2 ready and the request times out.
kubectl -n harvester-system get deployment -l harvesterhci.io/upgradeComponent=repo
# upgrade-repo-hvst-upgrade-... 1/2 2 1
kubectl -n harvester-system logs job/hvst-upgrade-...-post-drain-<node>
# curl: (18) end of response with 901 bytes missingThe upgrade ISO PVC was provisioned with numberOfReplicas: 1. That single replica landed on the node we just drained, so the volume goes Faulted/Detached the moment the node reboots. The repo pod on the surviving node has nothing to read from.
Fix
Two parts. Scale the repo deployment so the surviving replica isn’t blocked waiting for a second pod, and uncordon the rebooting node so its instance-manager can come back and re-attach the volume.
# Scale upgrade-repo down to 1 (the original blog fix)
kubectl -n harvester-system scale deployment \
upgrade-repo-hvst-upgrade-<id> --replicas=1
# Uncordon the post-drained node so longhorn can recover the volume
kubectl uncordon <node-name>I also bumped the upgrade ISO volume to numberOfReplicas: 2 in the Longhorn UI so a future drain doesn’t strand it. Once Longhorn reports the volume attached / healthy, the post-drain job’s curl succeeds, the node finishes its Waiting Reboot step, and the upgrade marches to completion.
What I’d do differently next time
Before clicking Upgrade, check
kubectl describe managedchart harvester -n fleet-localand the helm release secrets. If anything ispending-upgradeor has gaps in revision numbers, fix it first.Make sure
cdi cdi -o yamlhasspec.config.scratchSpaceStorageClassset. Default storage class is not a fallback for CDI scratch.Pre-bump important Longhorn volumes (especially the upgrade ISO once it lands) to
numberOfReplicas: 2on a multi-node cluster — single-replica volumes turn drains into outages.
Same general pattern as the previous two posts: Harvester upgrades are a state machine, and most stalls are some upstream component (Helm, CDI, Longhorn) refusing to make a destructive call without a human nudging it.
Is it just me? Does anyone else always have upgrade issues?
Cheers,
Joe




Convoluted upgrade process was the number one reason I stopped using Harvester. Every upgrade is multi-hours headache.