Designing for the Cliff: Calibrating Load to Reach maxmemory in a 1-Hour Run

This project has a hard contract: every run is exactly one hour. That constraint is what keeps results comparable across configurations and makes exported visualisation consistent.

After I fixed shutdown reliability, the next issue wasn't infrastructure. It was methodology.

The discovery: the "happy path" trap

Reviewing telemetry from a standard run, I noticed that my default load generation (single ECS task) was not reliably pushing Redis into the state I actually care about.

Memory usage was climbing, but in many cases the run could finish without reaching maxmemory. That means the test is still useful as a smoke check, but it's not a strong performance validation: it mostly measures a cache that's not under pressure.

Links:

Shutdown Didn't Happen: Placeholder Semantics Bug

AWS ElastiCache Lab project has a hard rule: a test run is defined as one hour. That only stays true if the lab reliably shuts down on schedule. If it doesn't, I lose cost control and-more importantly for benchmarking-I risk starting the next run from a non-clean baseline.

I hit exactly that problem on an evening run.

What I observed

The run finished, but the environment was still up. Nothing looked "broken" in the usual sense: services were alive and responsive. In this lab, though, "still works" past the run boundary is a defect, because it means the lifecycle automation failed silently.

Links:

Beyond Documentation: Building a Data-Driven Test Lab for ElastiCache

Docs Confidence is that warm, fuzzy feeling you get after reading AWS whitepapers - right before your cache hits 99% memory, your p99 latency grows a tail, and your assumptions start to melt. It's not incompetence, it's the gap between documented behavior and observed behavior under your specific workload.

I built this repeatable ElastiCache benchmarking platform to close that gap with receipts: timestamped telemetry and exportable artifacts that stand up in a design review. This lab isn't just about Redis (or Valkey) as software, it's about the architectural decisions that land in production: which engine, which instance class, and which topology delivers the best outcome for the budget.

Links: