Log0
Architecture

Service Internals

Class diagrams, data models, and Kafka event schemas for every log0 service. Reference documentation for implementation - covers the ingestion gateway, normalization service, incident service, notification service, all Kafka events, the PostgreSQL data model, and the ClickHouse log schema.

What This Section Covers

This section documents the internal structure of each service - the classes, their relationships, and the design decisions behind them. Use this as a reference when reading, writing, or reviewing implementation code.

Implementation status as of March 2026:

ServiceStatusNotes
Ingestion Gateway✅ ~85% completeREST API, Kafka producer, DLQ all working
Normalization Service✅ ~80% completeNormalization, fingerprinting, and producer reliability all complete
Clustering Service✅ ~95% completeCore pipeline implemented; Redis persistence and sliding window upgrade planned
Incident Service✅ ~80% completeKafka consumer, state machine, REST API, PostgreSQL (Flyway), DLQ, notifications all done. Filters and ClickHouse log lookup pending.
Notification Service✅ ~70% completeKafka consumer, Slack Bot API, Block Kit alerts, DLQ done. Slack interactions, per-tenant token, WhatsApp pending.
Auth Service✅ ~90% completeTenant, User, ApiKey, RefreshToken entities; JwtUtil, AuthService, TenantService, UserService, ApiKeyService; JwtAuthFilter, SecurityConfig; all controllers. Frontend integration pending.
AI Summary Service✅ ~80% completeStrategy pattern (Groq/OpenAI/Gemini/Anthropic), PromptBuilder, async callback. Redis cache and fallback template pending.

Ingestion Gateway

Responsibility: Accept log events from client services over HTTP, validate them, and publish them to Kafka. Return immediately - the gateway's job ends at 202 Accepted. It has no downstream dependencies except Kafka and Redis (rate limiting, planned).

Key design decisions:

  • HTTP entry point is synchronous; Kafka publish is asynchronous with a completion callback
  • On Kafka failure, the DLQ is published from within the async callback - the HTTP response is already 202 before the failure is known
  • All tenant context comes from HTTP headers, never from the request body. The body is the log event only.

Note on the two Jackson versions: RawLogEventSerializer uses tools.jackson (Jackson 3.x, co-packaged with Spring Boot 4.0.4), while DlqEventSerializer uses com.fasterxml.jackson (Jackson 2.x). This is intentional - full control over each serializer's behavior without relying on Spring's auto-configuration to manage both.


Normalization Service

Responsibility: Consume RawLogEvents from Kafka, transform them into a canonical schema, generate a deterministic fingerprint, and publish NormalizedLogEvents to the next topic. Also writes every normalized event to ClickHouse for historical storage.

Key design decisions:

  • Manual Kafka acknowledgment - offset committed only after successful publish or DLQ fallback. The partition never blocks.
  • FingerprintGenerator is designed as a separate @Component so it can be unit-tested independently of the normalization pipeline.
  • schemaVersion: "v1" field on every event enables future schema evolution without breaking existing consumers.

Clustering Service

Responsibility: Consume NormalizedLogEvents from Kafka, group them by fingerprint within a time window, and publish an IncidentEvent to incident-events when the occurrence threshold is crossed. Continues publishing updates on every subsequent match so the Incident Service can track occurrenceCount and lastSeenAt in real time.

Key design decisions:

  • OccurrenceStore is an interface - FingerprintClusterer never touches storage directly. Swapping in-memory for Redis later requires changing one class, not the clusterer.
  • Tumbling window (fixed 5-min buckets) chosen over sliding window for MVP simplicity - a burst of errors will cross the threshold regardless of window style.
  • Threshold and window duration are externalised into ClusteringConfig so they can be tuned without a code change.
  • topMessages collected per window - capped at 10 distinct values. Used later by the AI Summary Service to build its prompt.

The OccurrenceStore interface is the upgrade seam. When you need persistence or horizontal scaling, implement RedisOccurrenceStore and swap it in via Spring @Primary. FingerprintClusterer never changes.

The ClusterKey is what makes the tumbling window work. Two events in the same 5-minute bucket produce the same key → same counter. An event in the next bucket gets a fresh counter. The bucket is computed as timestamp.truncatedTo(ChronoUnit.MINUTES) rounded down to the nearest 5 - e.g. 10:03 → 10:00, 10:07 → 10:05.

Sliding window upgrade path (future): When accuracy across bucket boundaries matters, the upgrade involves four steps: (1) extract a WindowStrategy interface with computeKey() and isExpired() methods; (2) move the current bucket truncation logic into TumblingWindowStrategy; (3) implement SlidingWindowStrategy which stores a List<Instant> of per-event timestamps in OccurrenceWindow and evicts expired entries on each increment() call; (4) inject WindowStrategy into OccurrenceStore implementations. FingerprintClusterer requires no changes because it only talks to OccurrenceStore.


Incident Service

Responsibility: Consume IncidentEvents, maintain the incident lifecycle in PostgreSQL, enforce valid state transitions, trigger AI summary generation (Phase 6), and publish NotificationEvents. Also exposes a REST API for the web frontend and Slack interaction handler.


Notification Service

Responsibility: Consume NotificationEvents and deliver them to Slack and WhatsApp. Handle Slack interaction callbacks (button clicks, modal submissions) and route them back to the Incident Service.

The Notification Service is the only service with an inbound HTTP endpoint that is not triggered by a client. Slack calls its interaction webhook when users interact with buttons or submit modals.

Key design decisions:

  • Uses the Slack Bot API (chat.postMessage) - not Incoming Webhooks. Bot API supports message updates, DMs, and interactive callbacks; Incoming Webhooks are one-way only.
  • Bot token stored in application.yml for MVP. Full per-tenant token stored in SLACK_INTEGRATION table (Phase 7.2).
  • Messages use Block Kit (structured cards with action buttons). Buttons are rendered in MVP but interaction handling is wired in Phase 7.4.
  • Same always-ACK DLQ pattern as all other services - Slack API failure never blocks the Kafka partition.

AI Summary Service

Responsibility: Receive a summary request from the Incident Service, build a structured LLM prompt from the incident context, call the configured LLM provider, and write the result back to the Incident Service via REST. Runs as a standalone service on port 8085 so it can fail, retry, and scale independently of the incident pipeline.

Key design decisions:

  • Strategy Pattern for LLM providers - LlmProvider interface with interchangeable implementations (Groq, OpenAI, Gemini, Anthropic). Active provider selected by ai.provider in application.yml. See ADR-008.
  • Callback pattern - the Incident Service fires POST /api/v1/summaries via @Async (non-blocking), then the AI Service writes the result back via PATCH /api/v1/incidents/{id}/ai-summary. No Kafka topic required.
  • Prompt is self-contained - SummaryRequest carries everything needed (topMessages, severities, time range). No ClickHouse query for MVP.
  • Graceful degradation - if the LLM call times out or the provider is down, the incident still exists and is fully functional. ai_summary stays null and the Slack message renders without it.

Auth Service

Responsibility: Manage tenants, users, and API keys. Issue and validate JWT access tokens. Expose an internal API key validation endpoint for the Ingestion Gateway. Runs on port 8086.

Key design decisions:

  • Stateless JWT validation - each protected service (e.g. incident-service) validates JWT locally using a shared JWT_SECRET. No network call to auth-service on every request. See ADR-009.
  • API key validation is remote - the Ingestion Gateway calls POST /api/v1/auth/validate-key on auth-service. Only ingestion uses API keys; all other protected services use JWT.
  • Raw API key shown once - only the SHA-256 hash is stored in the API_KEY table. The raw key is returned once at generation time and never retrievable again.
  • Refresh token rotation - refresh tokens are stored as SHA-256 hashes, revoked on logout, and rotated on use (each refresh issues a new refresh token and invalidates the previous one).
  • Phase 1 is email/password only - Google OAuth, GitHub OAuth, and Slack OAuth are planned post-Phase 1 additions (see roadmap Phase 1.3.x).

Kafka Event Schemas

Every message flowing through the Kafka pipeline is a typed event. This diagram shows the transformation pipeline and the inheritance/evolution relationships between event types.

Schema evolution: The schemaVersion field on NormalizedLogEvent is intentional. When the normalization schema needs to change (new fields, changed types), consumers can branch on schemaVersion to handle both old and new formats during a rolling deployment.


PostgreSQL Data Model

This is the full relational schema for the incident management database. Every table is scoped to a tenant_id - there are no cross-tenant joins.

Design notes:

  • INCIDENT.fingerprint is not a foreign key - it's a string hash. The fingerprint is the primary grouping mechanism in the Clustering Service, not a relational concept. Keeping it as a string makes querying flexible and avoids coupling the incident model to the normalization schema.

  • INCIDENT.severity is a varchar not an enum column - matches SeverityResolver output: HIGH, MEDIUM, LOW. CRITICAL is not used in MVP - SeverityResolver has no path that produces it. If added in future, it is a non-breaking schema change.

  • INCIDENT.top_messages is a jsonb array - carries the top 10 distinct error messages collected by the Clustering Service. Stored now so the AI Summary Service (Phase 6) has prompt material without needing to re-read Kafka events.

  • INCIDENT.ai_summary is null until Phase 6 - field exists in schema, no value written by MVP code. AISummaryService is not implemented in Phase 5.

  • INCIDENT_ASSIGNMENT is an append-only table - each reassignment creates a new row. This preserves the full assignment history. The "current assignee" is the row with the most recent assigned_at. assigned_by_user_id is accepted from the request body without DB validation until Phase 1 (Auth) is implemented.

  • INCIDENT_STATE_HISTORY is an audit table - it captures every status transition with the actor and timestamp. This supports SLA tracking, post-incident review, and compliance reporting.

  • Re-open behavior - resolved incidents are considered closed. When a fingerprint reoccurs, createOrUpdateIncident looks for an open incident (status != RESOLVED). If none exists, a new incident is created from NEW. Deliberate re-opening (RESOLVED → NEW transition) is a Phase 5.x feature.

  • USAGE_METRICS is a daily rollup - it tracks billing-relevant metrics (logs ingested, incidents created, notifications sent) aggregated per tenant per day. This is the SaaS billing foundation.


ClickHouse Log Events Schema

ClickHouse stores every normalized log event. It is optimized for analytical queries - specifically the GROUP BY fingerprint query used to build AI summary prompts, and time-range queries for the log explorer UI.

Why ClickHouse instead of PostgreSQL for logs?

PostgreSQL is optimized for transactional workloads - row-level inserts, point lookups, relational joins. ClickHouse is a columnar database optimized for aggregation queries over large volumes of data.

The AI summary query (GROUP BY message ORDER BY COUNT(*) DESC) runs over potentially millions of rows per fingerprint. On PostgreSQL, this is a full table scan. On ClickHouse, it reads only the fingerprint, message, and tenant_id columns - a fraction of the data - and executes in milliseconds.

Schema notes:

  • DateTime64 - nanosecond precision timestamp. Log correlation across distributed services requires sub-millisecond timestamp accuracy.
  • attributes: Map(String, String) - a flexible key-value store for structured log fields that don't fit the fixed schema (request ID, user ID, correlation ID, etc.). ClickHouse's native Map type enables filtering on arbitrary attributes without schema migrations.
  • message_template and fingerprint are stored alongside the raw message - this enables queries like "show me all distinct messages for this incident" (raw) as well as "how many incidents share this pattern" (fingerprint).
  • ingested_at is the time the event entered ClickHouse, separate from timestamp (the time the log was emitted). The difference between these two values is the pipeline latency - a useful operational metric.

How is this guide?

On this page