Problem Statement
Describe your incident-response process when a Kubernetes production application becomes unavailable.
Explanation
In a real incident you first detect the issue via monitoring/alerts, then identify scope and impact (which services/pods/nodes are affected). You gather logs and metrics, isolate root cause (deployment issue, node failure, network partition), apply fix (for example roll-back, scale up resources, node replacement), restore service and then review incident in a post-mortem capturing what happened, why, how you responded and how to improve. This ensures your cluster reliability improves over time rather than just reacting to issues.
