Solving Problems

Solving problems 1: ECS, Event Bridge Scheduler, PHP, migrations

I love Mondays and Business as Usual. Solving problems is a delightful day-to-day task. Maybe this is what working with software means in the end. Do not take me wrong, it opens the doors for greenfield projects and experimentation. While mastering the business I can experiment, change and rebuild.

The solving problems series is just a way to share small ideas, experiences and outcomes of solving daily problems as I go. I wonder if some tips or experiences shared can help you build better what you are working on right now.

During the last months, I have been migrating an important PHP service to ECS Fargate along with the runtime upgrade. The service is composed of a lot of parts and we have been architecting the migration so the operation causes no downtime to customers, even when they are over four different continents and many time zones.

One very important part of the service is already running in production for some months with success. We are preparing the next service.

For the migration plan, we deployed infrastructure ahead of starting moving traffic, planned to daily incremental traffic switch, like 5, 10, 25, 50, 75, and close monitoring. Also prepared a second plan to avoid rollback in case some performance issue arises. While monitoring we created backlog tickets with the observability outcomes.

During migration phases prepare yourself beforehand for the initial (1%, or 5%) traffic switch, so you can catch quickly those hidden use cases that only happen in production and act quickly. If you do so, other phases are just a matter of watching how scaling works.

Using containers (of course Kubernetes is a great alternative) is a fantastic opportunity to upgrade PHP runtimes efficiently at the same time where we use a much better platform that helps with delivery and developer experiences. The very first and most important step I recommend is to review how you deal with your secret and environment variables. This is pivotal for the success of a smooth migration.

We can expect that those type of applications has a fair amount of cron jobs associated with them. This is a great opportunity to follow the old saying "use the right tool for the right problem" and my suggestion would be to rewrite it, turning it into Lambda or Step Functions, as applicable to each of what the cron job is doing. This is closer to what and how a job should run.

It happens that not always we can start refactoring right away, and then I can say that my experiences with Event Bridge Scheduler triggering ECS tasks (previously cron jobs) are great. They are interestingly cheap alternatives while waiting for the refactoring project to take over. Don't take this as your permanent solution though, because it is not just right and a waste of resources and couple the cron job too much with parts of the application not really related.

We were reviewing the backlog and observability results of the last service. As we could prioritise and execute some backlog tickets, the dashboard and metrics highlighted that we had some room to review scaling and resource thresholds. We changed them carefully, resulting in a bill ~50% cheaper, CPU and memory resource stable and no performance degradation.

Some notes:

  • Investing in test automation is good for your developer experience, site reliability and revenue; also a great support for technology improvements
  • It is worth taking a look at the ALBRequestCountPerTarget metric if you have CPU-heavy processes as you can better control how ECS will handle scale policies, avoiding peak of CPU where the CPU average metric is not enough for scaling