Categorias
Programação Solving Problems

Setting up maintenaince mode with Varnish

Varnish is "the free, open-source software that enables super fast delivery of HTTP or API based content", "an HTTP reverse proxy that works by caching frequently requested web pages, so they can be loaded quickly without having to wait for a server response.".

If you need some sort of an alternative cloud or servers in datacenter Varnish can act as CDN, Load Balancer and Api Gateway layers at the same time. It is very powerful when you have to manage those services instead of using a Cloud service. And this is not that uncommon.

Varnish

Consider the use case where you need a maintenance window for a product for which you need to be sure that you are suspending all connections to the backend servers in a consistent way. Performing a redirection in the CDN layer is the better choice.

The Varnish Configuration Language (VCL) is a domain-specific programming language used by Varnish to control request handling, routing, caching, and several other aspects.

-- https://www.varnish-software.com/developers/tutorials/varnish-configuration-language-vcl/

This is a very powerful language, that, for our use case, will allow creating a synthetic response to proxy the request to, instead of hitting the backends. There will be no need to create a web directory to be served for another web server, but just directly from Varnish.

# default.vcl
sub vcl_synth 
{
    # previous headers manipulation if you like
    # and other code that you need for synth if you like
    # (...)

    # Adding an x-cache header to indicate this is a synth response
    set resp.http.x-cache = "synth synth";

    # Maintenance - we are calling the status for this synth 911 because we can have different synths
    if ( resp.status == 911 ) {
        set resp.http.Content-Type = "text/html; charset=utf-8";
        # You can put absolutely what you want
        synthetic ({"
<html>
<head>
    <title>Maintenance mode - Try again later</title>
</head>
<body>
<h1>This website is under maintenance.</h1>
</body>
</html>
"});
        return (deliver);
    }
}

Then I can forward everything that requests my-domain.com to the maintenance

# includes/my-domain-hints.vcl

if ( req.http.host ~ "my-domain.com" ) {
    return(synth(911, ""));
    # All the other VCL configs are below here, but we are returning early above to the maintenance
    # (...)
}
Categorias
Solving Problems

Solving problems week 2: Cypress test automation, E2E, DevExp, code standards, rector

In this week's Solving Problems text, I have two topics: code quality and service reliability. I will share some ideas that can improve your set of tools that check if your codebase is healthy.

The problem with the Code Quality topic is how to keep your standards and some trivial issues out of the code without sacrificing your reviewer's time and patience.

And the problem with Service Reliability is how to be active in monitoring the most essential parts of your service without relying on human tests.

Automated tests

There are some red lights that you might be missing an important part of your Software Reliability: the QA team being a bottleneck due to human-resource timing constraints, too many PR reverts before deployment and lack of unit tests on each codebase. That usually comes with a very concerning outcome: customer support tickets. That is very bad for the product's reputation and a constant source of stress and bug lists to grow.

There are lots of quality checks that we could talk about, like enforcing unit testing coverage in the pipeline, applying feature flagging, improving the regression QA processing steps, etc. They are all pivotal and needed, but where to start? There is a first step, on similar scenarios, that I recommend prioritised, keeping the other processes running in parallel: automated end-to-end tests.

Testing plays a key role in development. By continuously monitoring application workflows and features, your tests can surface broken functionality before your customers do.

-- Best practices for creating end-to-end tests, DataDog

An Automated end-to-end test (we can also call it Smoke tests) must cover your most important scenarios and test them continuously and after any deployment. After you check those most four/five very important scenarios, you can keep improving the smoke tests based on the most used use cases. We have currently powerful tools to write them. Let's say cypress

describe('Verify dashboard', () => {
    const baseUrl = ${Cypress.env('baseUrl')};
    const env = Cypress.env('env');
    const dataFile: any = credentials_${env}.json;

    it('Verify raw Admin User profile', () => {
        cy.loginApplication();
        cy.fixture(dataFile).then(testData => {
            const profileUrl = testData['adminUser'].profileUrl;
            cy.visit(baseUrl + profileUrl);
        });
        cy.contains('Profile');
        cy.visit(${baseUrl}/logout);
    });
});

Developer experience

A quality code check that a mature codebase has is to lint and check code standards. As authors, we are used to waiting for CI to perform steps and be sure that the same successful result status we see when running the steps locally is also successful in CI. As reviewers, we are used to comments asking to check CI or asking to use the agreed pattern when the codebase doesn't have a good quality check step in place.

Both are part of the passive code review that steals our time from the essential problems we need to review in the code: architectural or business logic problems that are often missed by tired eyes. We can do better and use CI and the step tool in our favour.

This can be reproduced in any language, but focusing on PHP, we can use tools like Rector to check and fix those easy-to-spot problems. You can just set a step in our pipeline that will fix the errors and commit it again.

Some can say that it could be a pre-hook step, but this is usually skipped when takes more than 2s. I agree that this is easy to just run the fixes on the diff in our local, but we just have to do better if we automate the changes in case some developers just do not run it before pushing the commits or whatever reason.

This would be a very useful automation, that will run for every open Pull Request, committing code standards or lint issues. The pipeline will contribute a lot to your codebase with very little maintenance. Mind that this will execute Rector (therefore moving the code to the state your team agreed and not just code style) and Easy Code Standards (combine power of PHP_CodeSniffer and PHP CS Fixer in 3 lines)

name: Rector CI

on:
  pull_request: null

jobs:
  rector-ci:
    runs-on: ubuntu-latest
    # run only on commits on main repository, not on forks
    if: github.event.pull_request.head.repo.full_name == github.repository
    env:
      COMPOSER_AUTH: ${{ secrets.COMPOSER_AUTH }}
    steps:
      - uses: actions/checkout@v4
        with:
          # Solves the not "You are not currently on a branch" problem, see https://github.com/actions/checkout/issues/124#issuecomment-586664611
          ref: ${{ github.event.pull_request.head.ref }}
          # Must be used to trigger workflow after push
          token: ${{ secrets.GH_PAT_TOKEN }}

      - uses: shivammathur/setup-php@v2
        with:
          php-version: 8.1
          coverage: none
          extensions: <your-extensions-here>

      -   run: composer install --no-progress --ansi

      ## First run Rector without --dry-run, it would stop the process with exit 1 here
      -   run: vendor/bin/rector process --ansi

      - name: Check for Rector modified files
        id: rector-git-check
        run: |
          export CHANGES=$(if git diff --exit-code --no-patch; then echo "false"; else echo "true"; fi)
          echo "modified=$CHANGES" >> "$GITHUB_OUTPUT"

      - name: Git config
        if: steps.rector-git-check.outputs.modified == 'true'
        run: |
          git config --global user.name 'rector-bot'
          git config --global user.email '[email protected]'
          export LOG=$(git log -1 --pretty=format:"%s")
          echo "COMMIT_MESSAGE=${LOG}" >> "$GITHUB_ENV"

      - name: Commit Rector changes
        if: steps.rector-git-check.outputs.modified == 'true'
        run: git commit -am "[rector] ${COMMIT_MESSAGE}"

      ## Now, there might be coding standard issues after running Rector
      - run: composer run ecs:fix

      - name: Check for CS modified files
        id: cs-git-check
        run: |
          export CHANGES=$(if git diff --exit-code --no-patch; then echo "false"; else echo "true"; fi)
          echo "modified=$CHANGES" >> "$GITHUB_OUTPUT"

      - name: Git config
        if: steps.cs-git-check.outputs.modified == 'true'
        run: |
          git config --global user.name 'rector-bot'
          git config --global user.email '[email protected]'
          export LOG=$(git log -1 --pretty=format:"%s")
          echo "COMMIT_MESSAGE=${LOG}" >> "$GITHUB_ENV"

      - name: Commit CS changes
        if: steps.cs-git-check.outputs.modified == 'true'
        run: git commit -am "[cs] ${COMMIT_MESSAGE}"

      - name: Push changes
        if: steps.cs-git-check.outputs.modified == 'true'
        run: git push

Links:

Categorias
Solving Problems

Solving problems 1: ECS, Event Bridge Scheduler, PHP, migrations

I love Mondays and Business as Usual. Solving problems is a delightful day-to-day task. Maybe this is what working with software means in the end. Do not take me wrong, it opens the doors for greenfield projects and experimentation. While mastering the business I can experiment, change and rebuild.

The solving problems series is just a way to share small ideas, experiences and outcomes of solving daily problems as I go. I wonder if some tips or experiences shared can help you build better what you are working on right now.


During the last months, I have been migrating an important PHP service to ECS Fargate along with the runtime upgrade. The service is composed of a lot of parts and we have been architecting the migration so the operation causes no downtime to customers, even when they are over four different continents and many time zones.

One very important part of the service is already running in production for some months with success. We are preparing the next service.

For the migration plan, we deployed infrastructure ahead of starting moving traffic, planned to daily incremental traffic switch, like 5, 10, 25, 50, 75, and close monitoring. Also prepared a second plan to avoid rollback in case some performance issue arises. While monitoring we created backlog tickets with the observability outcomes.

During migration phases prepare yourself beforehand for the initial (1%, or 5%) traffic switch, so you can catch quickly those hidden use cases that only happen in production and act quickly. If you do so, other phases are just a matter of watching how scaling works.

Using containers (of course Kubernetes is a great alternative) is a fantastic opportunity to upgrade PHP runtimes efficiently at the same time where we use a much better platform that helps with delivery and developer experiences. The very first and most important step I recommend is to review how you deal with your secret and environment variables. This is pivotal for the success of a smooth migration.

We can expect that those type of applications has a fair amount of cron jobs associated with them. This is a great opportunity to follow the old saying "use the right tool for the right problem" and my suggestion would be to rewrite it, turning it into Lambda or Step Functions, as applicable to each of what the cron job is doing. This is closer to what and how a job should run.

It happens that not always we can start refactoring right away, and then I can say that my experiences with Event Bridge Scheduler triggering ECS tasks (previously cron jobs) are great. They are interestingly cheap alternatives while waiting for the refactoring project to take over. Don't take this as your permanent solution though, because it is not just right and a waste of resources and couple the cron job too much with parts of the application not really related.

We were reviewing the backlog and observability results of the last service. As we could prioritise and execute some backlog tickets, the dashboard and metrics highlighted that we had some room to review scaling and resource thresholds. We changed them carefully, resulting in a bill ~50% cheaper, CPU and memory resource stable and no performance degradation.

Some notes:

  • Investing in test automation is good for your developer experience, site reliability and revenue; also a great support for technology improvements
  • It is worth taking a look at the ALBRequestCountPerTarget metric if you have CPU-heavy processes as you can better control how ECS will handle scale policies, avoiding peak of CPU where the CPU average metric is not enough for scaling

Links: