fitness functions – Rafael Bernard Araujo

Chapter 2 introduces the concept of architectural fitness functions, the mechanism that makes "evolutionary" more than a buzzword.

The origin: borrowing from evolutionary computing

The term comes from genetic algorithm design. In evolutionary computing, a fitness function defines what "better" means so that solutions can gradually emerge through small changes across generations. The classic example: when using a genetic algorithm to optimise wing design, the fitness function assesses wind resistance, weight, air flow, and other desirable characteristics. At each generation, the engineer asks: is this closer to or further away from the goal?

Ford, Parsons and Kua borrow this concept for software:

An architectural fitness function provides an objective integrity assessment of some architectural characteristic(s).

In software, fitness functions check that developers preserve important architectural characteristics; the "-ilities" architects care about: scalability, security, performance, maintainability, resilience.

The core idea

An evolutionary architecture supports guided, incremental change across multiple dimensions. The key word is guided. Without guidance, incremental change is just drift. Fitness functions are what provide the guidance.

The fitness function protects the various architectural characteristics required for the system. These requirements differ greatly across systems and organisations: some require intense security; others require significant throughput or low latency; others need resilience to failure. A crucial early architecture decision is to define which dimensions matter most for a given system, based on business drivers, technical capabilities, and scale.

Why this matters

Most teams have implicit architectural goals: "the system should be fast", "services should be loosely coupled", "we should be secure". The problem is that implicit goals erode. Nobody notices the slow degradation until a characteristic has already failed.

Fitness functions make the implicit explicit. They turn architectural aspirations into verifiable checks. Automated where possible, manual where necessary.

A key insight: improving one architectural dimension can accidentally harm another. Improving performance with caching might harm data freshness or security. Fitness functions act as guardrails that detect these tradeoff violations before they reach production.

Categorising fitness functions

The book defines several dimensions for classifying fitness functions:

Atomic vs Holistic

Atomic — tests one particular aspect of the architecture in isolation. Example: a unit test checking for cyclic dependencies in a package, or a code metric that checks cyclomatic complexity.
Holistic — tests a combination of architectural aspects, assessing interactions between different concerns. Example: testing the number of concurrent users within a certain latency range while caching is enabled — this simultaneously checks scalability and data freshness. Holistic functions are harder to build but capture what atomic ones miss.

Triggered vs Continuous vs Temporal

Triggered — executed in response to a specific event: a developer running a unit test, a CI pipeline stage, a QA person performing exploratory testing.
Continuous — constant verification of architectural aspects. Monitoring and alerting are the classic examples. Netflix's Chaos Monkey — which runs in production and randomly terminates instances — is a continuous holistic fitness function that forces teams to build resilient services.
Temporal — have a particular time component. Example: a reminder to check whether important security updates have been performed, or a scheduled dependency check that alerts on outdated libraries.

Static vs Dynamic

Static — fixed predefined acceptable values. Binary pass/fail (a unit test), or a threshold (latency must be < 200ms).
Dynamic — acceptable values depend on context. Acceptable latency might depend on actual system scale; security requirements might vary based on the regulatory environment.

Automated vs Manual

Automated — unit tests, deployment pipeline checks, stress tests, chaos engineering. Ideally as much automation as possible.
Manual — some things can't be automated (legal approval requirements, certain QA processes). Some things aren't automated yet. The goal is to push the boundary toward automation over time.

What fitness functions look like in practice

Fitness functions encompass existing engineering practices but also extend beyond them:

Category	Examples	Type
Architecture tests	phpat (PHP/PHPStan) or ts-arch (TypeScript) rules checking component dependencies, layer violations, naming conventions, import directionality	Atomic, triggered
Code metrics	Cyclomatic complexity thresholds, afferent/efferent coupling limits	Atomic, triggered
Contract tests	API contract verification ensuring requirements are met	Atomic, triggered
Security scanning	Vulnerability scanning, licence compliance checks for open-source dependencies	Atomic, triggered
Performance testing	Load tests validating latency SLOs under expected concurrency	Holistic, triggered
Monitoring & alerting	p99 latency monitors, error rate thresholds, SLO compliance dashboards	Atomic/holistic, continuous
Chaos engineering	Netflix Simian Army — randomly terminating instances, availability zones, or entire regions	Holistic, continuous
Security reviews	Quarterly security audits, penetration testing	Holistic, manual/temporal
Dependency freshness	Scheduled checks for outdated libraries or security patches	Atomic, temporal

The best fitness functions are automated and triggered: they give feedback at the point of change, not weeks later. Place them in the deployment pipeline. Fast atomic functions early, slow holistic functions later.

Deployment pipelines as the enforcement mechanism

Fitness functions only work if they're part of the delivery workflow. The deployment pipeline is where they live:

Early stages — fast, atomic checks: architecture tests (phpat, ts-arch), code metrics, linting, security scanning, contract tests.
Middle stages — integration and performance tests, holistic triggered functions.
Later stages / production — continuous monitoring, chaos engineering, temporal reminders.

As Thoughtworks puts it: "creating the desired fitness functions — and including them in appropriate delivery pipelines — communicates these metrics as an important aspect of enterprise architecture."

The four layers of fitness (from NILUS)

A useful framing from practice splits fitness functions across four layers:

Structural fitness — code dependencies, database access patterns, API contracts, service boundaries.
Behavioural fitness — latency, resilience, throughput, consistency, recovery behaviour.
Operational fitness — deployment independence, observability coverage, runbook readiness, SLO compliance.
Semantic fitness — bounded context integrity, event naming quality, policy ownership, domain model consistency.

Most teams start at structural (the easiest to automate) and never reach semantic. But semantic fitness functions (checking that your domain model remains coherent as it evolves) are often the most valuable for long-lived systems.

Systems thinking

Dr. Russell Ackoff's quote captures the deeper point:

A system is never the sum of its parts. It is the product of the interaction of its parts.

Fitness functions that only measure individual components miss the point. The interesting failures happen at integration boundaries — between services, between teams, between intentions and reality. Holistic fitness functions (end-to-end latency, deployment frequency, change failure rate) capture what atomic ones cannot.

How I'm applying this

This connects directly to work I care about:

Platform modernisations I've designed and implemented were operational fitness: bringing reliability through automated deployment pipelines, observability and monitoring, and runbook readiness. I just called it "keeping things running."
ADRs capture the decisions. Fitness functions verify those decisions are still holding. Decisions and verification go hand in hand.
Kent Beck's Test Desiderata is itself a fitness function for test quality — a checklist of characteristics that tests should exhibit (isolated, deterministic, fast, behavioural, structure-insensitive, specific, predictive).
DORA metrics (deployment frequency, lead time, change failure rate, MTTR) are fitness functions for delivery capability.
Code health metrics (as described in the Loveholidays case from Tropeçando 120) are fitness functions that enabled their AI-first shift — they invested in code health metrics before adopting AI, which is exactly the fitness-function-first approach.
phpat (PHP, as a PHPStan extension) and ts-arch (TypeScript) — writing architecture rules as unit tests that run in CI is the purest implementation of triggered atomic fitness functions.

The pattern: define what matters, measure it, enforce it automatically, and revisit periodically. Architecture that can't be verified can't evolve — it can only decay.