Shutdown Didn't Happen: Placeholder Semantics Bug

AWS ElastiCache Lab project has a hard rule: a test run is defined as one hour. That only stays true if the lab reliably shuts down on schedule. If it doesn't, I lose cost control and-more importantly for benchmarking-I risk starting the next run from a non-clean baseline.

I hit exactly that problem on an evening run.

What I observed

The run finished, but the environment was still up. Nothing looked "broken" in the usual sense: services were alive and responsive. In this lab, though, "still works" past the run boundary is a defect, because it means the lifecycle automation failed silently.

Root cause (shutdown semantics)

The trigger was a recent change: I introduced a shutdown placeholder in Terraform to wire shutdown configuration through the stack.

My Python shutdown function had logic that effectively did: "if the shutdown value exists, it's already set." The placeholder satisfied that existence check, so the function treated shutdown as already configured and exited without executing the shutdown actions.

My intended behavior was the opposite: the function should overwrite the placeholder with the real runtime value and continue.

So the bug wasn't complexity-it was semantics. I checked presence, but I needed to check validity.

What I changed

I tightened the shutdown path so it can't make that mistake again:

A placeholder is treated as unset, not "configured".
Overwrite behavior is explicit, not implied by defaults.
Shutdown is considered "done" only when the environment is observably stopped (not just because the function ran).

That keeps the one-hour contract meaningful and preserves comparability by ensuring each run ends cleanly.

Tags:

AWS, cache, redis, Amazon ElastiCache, lab, github, terraform, load testing