Problem Statement
How do you detect and correct shard skew in production?
Explanation
Track per-shard QPS, p95 latency, storage, and hot-key frequency. Alert on deviations from the median and on sustained hotspot ratios. Sample keys of top contributors to locate bad partition choices.
To fix, add virtual nodes, change the hash function with a remap table, or split hot keys using salting. Run an online rebalancer that backfills and then flips routing gradually.
Code Solution
SolutionRead Only
metrics: shard_qps, shard_p95_ms, top_keys{key,qps}