Excerpt: One-time MongoDB Repair Guide
One-time MongoDB repair Guide
This guide describes how to run a one-shot mongod --repair against the data
volume of the MongoDB instance that is deployed as a subchart of Smartfacts.
In all commands and manifests below, replace
<release>with the actual
Helm release name and<namespace>with the namespace of the Smartfacts
deployment.
Scope: Standalone MongoDB only. This guide assumes the MongoDB subchart
is deployed witharchitecture: standalone(the Smartfacts chart default —
see Reference: chart defaults below). It does not apply to deployments
that run MongoDB as a replica set (architecture: replicaset, including
single-node replica sets withreplicaCount: 1). On a replica set, the
recommended recovery is to resync the affected member from a healthy
primary, notmongod --repair.
Procedure
The repair is performed by a one-shot Kubernetes Job that mounts the same
PVC as the MongoDB StatefulSet, using the same image and security context.
The Job is independent of the Helm release and can be removed without trace
once the repair has finished.
1. Stop everything that writes to MongoDB
Scale all Smartfacts services that talk to MongoDB down to zero so that no
client writes during or after the repair:
kubectl -n <namespace> scale deploy --all --replicas=0If any deployment in the namespace runs with more than one replica (e.g.
for high availability), record its replica count beforehand withkubectl -n <namespace> get deployso that you can restore it in step 6.
2. Take a backup of the PVC
mongod --repair may discard data that it considers unrecoverable. Always
take a backup first, e.g. via a VolumeSnapshot (preferred, if your storage
class supports it) or by attaching a helper Pod to the PVC and streaming atar archive to safe storage.
3. Scale the MongoDB StatefulSet to zero
The PVC is ReadWriteOnce, so the repair Job can only attach once themongod Pod has released the volume:
kubectl -n <namespace> scale statefulset <release>-mongodb --replicas=0
kubectl -n <namespace> wait --for=delete pod/<release>-mongodb-0 --timeout=120s4. Run the repair Job
Apply the following manifest. It uses the same image, the samerunAsUser/fsGroup, and the same GLIBC_TUNABLES environment variable as
the StatefulSet, so the repair runs against the data files with a
binary-compatible mongod:
apiVersion: batch/v1
kind: Job
metadata:
name: mongodb-repair
spec:
backoffLimit: 0
template:
spec:
restartPolicy: Never
securityContext:
fsGroup: 1001
containers:
- name: mongod-repair
# Attention: If you did override the mongodb image in your values file, then replace this image with the image stated there!
image: repo.mid.de/mongodb:5.0-MID1.1.0
securityContext:
runAsUser: 1001
runAsNonRoot: true
env:
- name: GLIBC_TUNABLES
value: "glibc.pthread.rseq=0"
command: ["mongod"]
args:
- "--repair"
- "--dbpath=/bitnami/mongodb/data/db"
volumeMounts:
- name: datadir
mountPath: /bitnami/mongodb
volumes:
- name: datadir
persistentVolumeClaim:
claimName: datadir-<release>-mongodb-0ATTENTION: If you did override the Mongodb image version in your values file, then make sure that you use the same image version in the job as well (replace the tag in the image: property in the job specification)!
To apply the manifest, save it into a file mongodb-repair-job.yaml, replace the placeholder "<release>" with your release name and apply it in the Smartfacts namespace:
kubectl apply -f mongodb-repair-job.yaml -n <namespace>Watch the logs and wait for the Job to complete successfully:
kubectl -n <namespace> logs -f job/mongodb-repair
kubectl -n <namespace> wait --for=condition=complete job/mongodb-repair --timeout=2hThe repair is finished when mongod exits with code 0 and the log shows
something like Finished checking dbs. If the Job ends with Failed, do not bring
the database back up; investigate the logs and restore from the backup
taken in step 2.
5. Clean up and restart MongoDB
kubectl -n <namespace> delete job mongodb-repair
kubectl -n <namespace> scale statefulset <release>-mongodb --replicas=1
kubectl -n <namespace> rollout status statefulset/<release>-mongodb6. Verify and bring the application back up
Run a validate on every collection to confirm the data is consistent. The
command below pipes the output through grep -v "valid=true" so that only
problematic collections show up — if nothing is printed, all collections
validated successfully:
kubectl -n <namespace> exec <release>-mongodb-0 -- mongosh --quiet --eval '
db.getMongo().getDBNames().forEach(function(n) {
if (["admin","local","config"].indexOf(n) >= 0) return;
var d = db.getSiblingDB(n);
d.getCollectionNames().forEach(function(c) {
var r = d.runCommand({ validate: c });
print(n + "." + c + " valid=" + r.valid);
});
});' | grep -v "valid=true"If the command produces no output, all collections validated successfully
and the Smartfacts services can be scaled back up:
kubectl -n <namespace> scale deploy --all --replicas=1If you noted any deployments with more than one replica in step 1, scale
those individually back to their original count, for example:kubectl -n <namespace> scale deploy <release>-sfit-platform --replicas=2.
Verify that all Pods come up healthy:
kubectl -n <namespace> get podsIf any line appeared in the validation output, do not scale the services
back up; investigate the listed collections first.
Rollback
If the repair fails or causes follow-up issues:
Scale the MongoDB StatefulSet to zero again.
Restore the PVC from the backup taken in step 2 (re-create the PVC from
the snapshot, or restore thetararchive into a fresh PVC of the same
name).Scale the StatefulSet back to one and verify connectivity before
restarting the Smartfacts services.
Reference: chart defaults
The Smartfacts chart pulls in the Bitnami mongodb chart (version 12.1.31)
as a subchart. With the Smartfacts chart defaults this means:
architecture: standaloneuseStatefulSet: true(so the Pod is owned by a StatefulSet, not a
Deployment, and the PVC is created viavolumeClaimTemplates)auth.enabled: falsepersistence.size: 50GiCustom MID image
repo.mid.de/mongodb:5.0-MID1.1.0extraEnvVarssetsGLIBC_TUNABLES=glibc.pthread.rseq=0(required for the
UBI-based image)
The manifests in this guide use the following paths and names. If your
deployment overrides any of these in its values, adjust the manifests
accordingly:
Item | Value |
|---|---|
PVC name |
|
Mount path |
|
|
|
Container |
|
Pod |
|
Image |
|