Architecture Decisions

Architecture Decision Records (ADRs) for every significant design choice in log0 - why Kafka, why ClickHouse, why deterministic fingerprinting, why manual ACK, and more.

About These Records

Architecture Decision Records (ADRs) document the why behind design choices. Code shows you what was built. ADRs show you why it was built that way - including the alternatives that were considered and rejected.

Each record follows this format: Context (what problem forced a decision), Decision (what was chosen), Alternatives considered (what was rejected), and Consequences (trade-offs accepted).

ADR-001: Kafka-API Event Bus (Redpanda)

Status: Accepted · updated 2026-06 (Apache Kafka → Redpanda)

Context

log0 needs to move log events through a multi-stage processing pipeline: ingestion → normalization → clustering → incident creation → notification. The pipeline stages have different throughput requirements, can fail independently, and need to scale independently. The ingestion gateway must return immediately to the client - it cannot wait for downstream processing to complete.

A message queue or event bus is required to decouple the stages.

Decision

Use a Kafka-API event bus between all services. The deployed broker is Redpanda v24.2.7 - a single Kafka-compatible binary with no Zookeeper. The original local stack used Apache Kafka (Confluent 7.5.0); the migration to Redpanda changed only the bootstrap-servers address. The Kafka wire protocol, topics, consumer groups, and all producer/consumer code are identical across both, so everything below applies unchanged.

Alternatives Considered

Alternative	Why Rejected
RabbitMQ	AMQP-based push model makes it harder to replay events. No native consumer group semantics. Log replay (for debugging or DLQ recovery) requires additional tooling.
AWS SQS/SNS	Vendor lock-in. No consumer group semantics - SQS requires one queue per consumer. Fan-out to multiple consumers requires SNS → multiple SQS queues. Replay is not native.
Redis Streams	Suitable for low-throughput use cases. At 10,000 logs/second, Redis becomes a bottleneck. No partition-level ordering guarantee keyed by tenantId.
Direct HTTP	Synchronous coupling. If the normalization service is down, ingestion fails. Does not support the at-least-once reliability model.

Consequences

Accepted trade-offs:

An event bus is still infrastructure to run locally (Docker Compose) - though Redpanda makes this a single container
Message ordering is guaranteed per partition (per tenantId), not globally

Benefits realized:

Each service can scale independently based on its consumer lag
Consumer group semantics allow multiple normalization instances to share load automatically
Message replay is native - reprocessing from a given offset requires no extra tooling
The DLQ pattern is a natural extension of the topic model
Producer idempotence (enable.idempotence=true) prevents duplicate messages under retries
Redpanda specifically: no Zookeeper, one binary, lower memory (~1 GB saved vs a JVM broker + Zookeeper), and faster startup - ideal for the single-host deployment. See Deployment.

ADR-002: ClickHouse for Log Event Storage

Status: Accepted

Context

log0 stores every normalized log event for historical queries and AI summary generation. The access patterns are:

High-volume writes: up to 10,000 events/second
Aggregation queries: GROUP BY fingerprint ORDER BY COUNT DESC over millions of rows
Time-range scans: "show all ERROR logs for tenant X in the last 24 hours"
Filtered reads: "show logs matching fingerprint Y for incident Z"

These are analytical access patterns, not transactional ones.

Decision

Use ClickHouse as the log event store.

Alternatives Considered

Alternative	Why Rejected
Elasticsearch	Excellent full-text search, but expensive to operate at scale. Storage is 3–5x ClickHouse for the same data. Aggregation queries (GROUP BY, COUNT) are significantly slower on Elasticsearch than ClickHouse.
PostgreSQL	Already used for incident state. Not suitable for log storage - OLTP database, poor compression for time-series data, slow on large aggregations. `GROUP BY fingerprint` over millions of rows requires full table scans.
DynamoDB	No native aggregation support. Running GROUP BY queries requires exporting to Athena. Operational complexity disproportionate to the use case.
Loki (Grafana)	Opinionated log label model. Limited query language. Better suited for infrastructure logs than application event analytics.

Consequences

Accepted trade-offs:

ClickHouse has a different query dialect (not standard SQL in all aspects)
ACID transactions are not supported - not appropriate for transactional writes
Learning curve for engineers familiar with PostgreSQL but not ClickHouse

Benefits realized:

Columnar storage means GROUP BY fingerprint reads only the fingerprint column - dramatically faster than row-based alternatives
Native compression reduces storage cost 5–10x compared to Elasticsearch for log data
DateTime64 type provides nanosecond precision for accurate log correlation
Map(String, String) type for attributes allows schema-free structured data without migrations

ADR-003: Deterministic SHA-256 Fingerprinting

Status: Accepted

Context

Incident deduplication requires grouping "similar" log events together. Two errors that differ only in dynamic values (user ID, IP address, timeout duration) should be treated as the same error pattern. The question is: how do you define "similar"?

Decision

Use deterministic SHA-256 fingerprinting based on a normalized message template.

The fingerprint formula:

fingerprint = SHA-256(
    serviceName + "|" +
    messageTemplate + "|" +    ← dynamic values stripped
    exceptionType + "|" +      ← nullable
    firstStackFrame            ← line numbers stripped
)

Message templating strips dynamic values before hashing:

Numbers → <number>
IP addresses → <ip>
UUIDs → <uuid>

Alternatives Considered

Alternative	Why Rejected
ML-based semantic clustering	Requires training data, model maintenance, and introduces non-determinism. The same error may be clustered differently after a model update, breaking existing incidents. No reproducibility guarantee.
Edit distance / Levenshtein	O(n²) complexity. At 10,000 events/second, pairwise comparison is computationally infeasible.
Exact message match	Too strict. Two instances of `timeout after 30000ms` and `timeout after 28514ms` would create separate incidents.
Exception class only	Too loose. `NullPointerException` appears everywhere. Grouping by exception type only would aggregate unrelated errors.

Consequences

Accepted trade-offs:

Novel log formats with irregular structure may not template cleanly, producing noisy fingerprints
The regex patterns for stripping dynamic values must be maintained as log formats evolve
Edge cases: logs with no stack trace, logs in non-English languages, logs with custom dynamic patterns

Benefits realized:

Fully deterministic - the same log always produces the same fingerprint, regardless of time or deployment
No infrastructure required beyond a SHA-256 hash function
Fingerprints are stable across deployments - an incident created today and a reoccurrence next week will share the same fingerprint and be correctly linked
Zero false negatives for structurally identical errors (unlike fuzzy matching)

ADR-004: Always-ACK DLQ Pattern

Status: Accepted

Context

Kafka consumers in log0 use manual offset acknowledgment. When a consumer fails to process a message (normalization error, serialization failure, downstream Kafka publish failure), a decision is needed: what should happen to the unprocessed message?

The naive answer is to not ACK - let Kafka redeliver the message. But if the message itself is the problem (malformed payload, unexpected schema), it will fail on every redelivery, permanently stalling the partition.

Decision

On any processing failure:

Wrap the original event in a DlqEvent (capturing originalEvent, errorMessage, failedAt, failedAtTs)
Publish the DlqEvent to raw-logs-dlq
Always ACK the original offset - even if the DLQ publish fails

Alternatives Considered

Alternative	Why Rejected
Retry without DLQ	A bad message is retried indefinitely. One malformed event from a misconfigured service stalls the entire partition for all tenants sharing that partition.
Discard on failure	Data loss. No ability to replay or investigate failures.
DLQ only if retries exhausted	Adds retry delay before DLQ capture. For a known-bad message, retries add latency without benefit.
Block partition on failure	Guarantees ordering but makes the system brittle. One bad actor (a service sending malformed logs) can halt the entire ingestion pipeline.

Consequences

Accepted trade-offs:

Messages that fail DLQ publish are lost (the ACK has already been sent). This is an acknowledged edge case - a failure to write to the DLQ is treated as fatal and triggers an alert.
The DLQ must be monitored. A growing DLQ is a silent signal that something is broken upstream.

Benefits realized:

The ingestion and normalization pipelines are never stalled by individual bad messages
Failed events are preserved with full context (originalEvent, error, timestamp, service) for post-mortem analysis
Replay is possible: fix the bug, re-publish originalEvent to raw-logs, the pipeline processes it correctly

ADR-005: Manual Kafka Offset Acknowledgment

Status: Accepted

Context

Kafka consumers can manage offset commits in two ways: automatically (Kafka commits periodically, regardless of processing state) or manually (the application commits only when it decides to).

Decision

Use manual offset acknowledgment for all Kafka consumers in log0.

Configuration:

kafka:
  consumer:
    enable-auto-commit: false
  listener:
    ack-mode: manual

Consumer code:

@KafkaListener(topics = "raw-logs")
public void consume(RawLogEvent event, Acknowledgment ack) {
    try {
        // process...
        producer.publish(normalized);
        ack.acknowledge();   // commit only on success
    } catch (Exception e) {
        dlqProducer.publish(dlqEvent);
        ack.acknowledge();   // commit after DLQ (ADR-004)
    }
}

Alternatives Considered

Alternative	Why Rejected
Auto-commit	Kafka commits the offset after a configurable interval, regardless of whether processing succeeded. If the service crashes after committing but before completing processing, messages are silently lost. At-most-once semantics - not acceptable for incident data.

Consequences

Accepted trade-offs:

If the service crashes after processing but before ACKing, the message is reprocessed. Consumers must be idempotent.
Slightly more complex consumer code

Benefits realized:

At-least-once delivery guarantee - no message is ever silently lost
Combined with the DLQ pattern, every message either succeeds or is captured for replay

ADR-006: Custom Kafka Serializers

Status: Accepted

Context

Spring Kafka provides a JsonSerializer and JsonDeserializer that use Jackson to convert Java objects to/from bytes. They work automatically but rely on Spring's managed Jackson ObjectMapper, which introduces coupling to Spring's auto-configuration and Jackson version management.

Decision

Write custom serializers and deserializers for every Kafka event type.

Each serializer is a straightforward wrapper:

public class RawLogEventSerializer implements Serializer<RawLogEvent> {
    private final ObjectMapper objectMapper = new ObjectMapper()
        .registerModule(new JavaTimeModule());

    @Override
    public byte[] serialize(String topic, RawLogEvent data) {
        return objectMapper.writeValueAsBytes(data);
    }
}

Consequences

Accepted trade-offs:

More classes to maintain (one serializer + one deserializer per event type)
Jackson version must be explicitly managed per serializer

Benefits realized:

Full control over serialization behavior - no Spring magic
Custom serializers can be unit tested independently
Jackson version can be chosen per serializer (tools.jackson 3.x for ingestion, com.fasterxml.jackson 2.x for normalization - intentional due to Spring Boot 4 transitional period)
Schema evolution is explicit: changing a serializer is a deliberate code change, not an implicit configuration change

ADR-007: `tenantId` as the Kafka Partition Key

Status: Accepted

Context

Kafka assigns messages to partitions based on the message key. The partition a message lands on determines which consumer instance processes it (within a consumer group). The key design determines two things: ordering guarantees and load distribution.

Decision

Use tenantId as the Kafka message key for all topics in the data pipeline (raw-logs, normalized-logs). Use incidentId as the key for notification-events.

Reasoning

All messages with the same key land on the same partition, and Kafka guarantees ordering within a partition. Setting tenantId as the key means:

All logs from the same tenant are processed in order. This is critical for the Clustering Service - if two logs from the same tenant are processed out of order, the clustering time window may be computed incorrectly.
Tenant data is isolated to specific partitions. While this is not a security boundary (all consumers read all partitions), it simplifies reasoning about which consumer instance is handling which tenant's data.

notification-events uses incidentId as the key because multiple notifications about the same incident (created, assigned, resolved) must be processed in order by the Notification Service to correctly update the Slack message.

Consequences

Accepted trade-offs:

Uneven partition distribution if tenant log volume is highly skewed (one large tenant, many small ones). Mitigated at scale by sub-tenant keying (e.g., tenantId:serviceId).
A tenant cannot be distributed across multiple consumer instances within the same group - all their logs are on one partition. Horizontal scaling applies across tenants, not within a single tenant.

Benefits realized:

Per-tenant ordering guarantee - essential for correct clustering
Simple, predictable routing - given a tenantId, you know which partition its events are on
Natural foundation for per-tenant rate limiting at the partition level

ADR-008: Strategy Pattern for LLM Provider

Status: Accepted

Context

The AI Summary Service must call an external LLM to generate incident summaries. The LLM landscape is fragmented: OpenAI, Azure OpenAI, Google Gemini, Anthropic, Groq, and self-hosted models all have different base URLs, authentication schemes, and request/response shapes. Hardcoding any single provider creates lock-in and makes development difficult (not everyone has a paid OpenAI key).

Decision

Use the Strategy Pattern to abstract LLM provider selection behind a single interface.

public interface LlmProvider {
    String generateSummary(String prompt);
}

Each provider implements this interface independently. The active implementation is selected at startup via @ConditionalOnProperty:

# application.yml
ai:
  provider: groq   # swap to: openai | gemini | anthropic | azure

Default provider: Groq. Groq offers a free API tier with an OpenAI-compatible endpoint (/openai/v1/chat/completions), making GroqProvider nearly identical in code to OpenAiProvider. This means the system works out-of-the-box with a free API key during development.

Alternatives Considered

Alternative	Why Rejected
Hardcode OpenAI	Requires a paid API key. Breaks local dev for contributors without one. Any provider change requires code surgery across the service.
LangChain4j or Spring AI	Adds a framework abstraction layer on top of another abstraction. For a single use case (one prompt, one response), the overhead is not justified. Direct HTTP calls give full visibility and control.
Kafka event for summary request	Adds a new topic and consumer just to trigger an LLM call. The AI Service is already a dedicated service - a synchronous REST call from Incident Service is simpler and sufficient.

Consequences

Accepted trade-offs:

Each LlmProvider implementation must handle its own HTTP client, error handling, and retry logic. Some duplication across providers.
Adding a new provider requires a new class and a new @ConditionalOnProperty config block.

Benefits realized:

Swap providers by changing one config line - no code changes, no redeployment of other services
GroqProvider works with a free API key, unblocking local development
Each provider is independently unit-testable
Same pattern already used in the codebase: OccurrenceStore interface → InMemoryOccurrenceStore in the Clustering Service

ADR-009: Stateless JWT Validation in Each Service

Status: Accepted

Context

log0 has a dedicated auth-service (port 8086) that issues JWT access tokens on login. When a protected service (e.g. incident-service) receives a request carrying a JWT, it must verify the token is valid. There are two architectural options:

Remote validation - each service calls auth-service on every request: GET /api/v1/auth/validate?token=...
Local validation - each service validates the JWT locally using the shared JWT_SECRET signing key

Decision

Use local (stateless) JWT validation in every protected service via a JwtAuthFilter (OncePerRequestFilter). Each service holds the JWT_SECRET as an environment variable and verifies the HS256 signature and exp claim directly - no network call to auth-service.

API key validation (ingestion path) remains remote - the Ingestion Gateway calls POST /api/v1/auth/validate-key on auth-service because API keys require a database lookup (hash comparison against the API_KEY table) that cannot be done statelessly.

Alternatives Considered

Alternative	Why Rejected
Remote JWT validation on every request	Adds network latency on every authenticated API call. Makes auth-service a synchronous dependency - if it is slow or down, all API calls fail. Unnecessary given JWT is cryptographically self-verifiable.
API Gateway handling all auth	Would require introducing Spring Cloud Gateway or a similar proxy. Adds infrastructure complexity that is not justified for the current number of services. Can be added later without changing the JWT design.
Opaque tokens (session IDs)	Would require a stateful session store (Redis) shared across services. Adds infrastructure dependency and per-request DB lookup. JWT's stateless verifiability is a key advantage for a distributed system.

Consequences

Accepted trade-offs:

JWT_SECRET must be shared across all services that validate tokens. Secret rotation requires redeployment of all protected services simultaneously.
Issued tokens cannot be revoked before expiry (short 1-hour TTL mitigates this). Logout only revokes the refresh token - the access token remains valid until it expires.
Each protected service must include JwtAuthFilter and JwtUtil - small duplication, mitigated by a shared library (Phase 0.4).

Benefits:

Zero latency overhead on the hot API path - no auth-service call per request
auth-service is not a runtime dependency of incident-service for reads
Horizontally scalable - each service instance validates independently with no shared state

Architecture Decisions

On this page