Skip to main content

Infrastructure Monitoring & Alerting

GitGazer includes Terraform-managed infrastructure monitoring built on Amazon CloudWatch and SNS.

Overview

Monitoring resources are defined in infra/monitoring_alarms.tf:

  • CloudWatch metric alarms for API, Lambda, SQS, RDS/Aurora, CloudFront, and EventBridge
  • CloudWatch Logs metric filters for Lambda ERROR-level logs
  • SNS topic for alert notifications

Global Toggle

All monitoring resources are controlled by one Terraform variable:

  • enable_cloudwatch_alarm_notifications
  • Default: true

When this variable is false, Terraform creates:

  • no SNS alarm topic
  • no CloudWatch metric alarms
  • no CloudWatch log metric filters for Lambda log error detection

Set it in infra/terraform.tfvars:

enable_cloudwatch_alarm_notifications = false

Apply the change:

cd infra
aws-vault exec <profile> -- terraform apply

Alarm Delivery

When monitoring is enabled, alarms publish to an SNS topic:

  • Resource: aws_sns_topic.cloudwatch_alarms
  • Output: cloudwatch_alarms_topic_arn

You can retrieve it via:

cd infra
aws-vault exec <profile> -- terraform output cloudwatch_alarms_topic_arn

Add one or more SNS subscriptions (email, webhook bridge, incident tooling) to actually receive notifications.

Alarm Coverage

Lambda

  • Invocation errors (AWS/Lambda Errors)
  • Throttles (AWS/Lambda Throttles)
  • Duration p95 near timeout (AWS/Lambda Duration)
  • Log-based error detection via CloudWatch Logs metric filters (ERROR-level JSON fields)

API Gateway

  • HTTP API 5xx
  • HTTP API p95 latency
  • WebSocket API 5xx

SQS

  • Main queue visible backlog
  • Oldest message age
  • DLQ depth

RDS / Aurora

  • Database connections high
  • ACU utilization high (serverless classes)
  • Deadlocks
  • Free local storage low (per Aurora instance)

CloudFront

  • UI distribution 5xx error rate
  • Docs distribution 5xx error rate (only when docs are enabled)

EventBridge

  • Org sync schedule failed invocations

Noise Controls

Current defaults are tuned to reduce alert fatigue:

  • treat_missing_data = "notBreaching"
  • insufficient_data_actions = []
  • ok_actions only on critical alarms (for example Lambda invocation errors, DLQ depth, EventBridge failed invocations)

Cost Notes

Main incremental cost drivers are:

  • CloudWatch custom metrics generated by log metric filters
  • CloudWatch alarms

Log metric filters are low operational overhead, but custom metric count scales with the number of monitored Lambda log groups.