Infrastructure Monitoring & Alerting
GitGazer includes Terraform-managed infrastructure monitoring built on Amazon CloudWatch and SNS.
Overview
Monitoring resources are defined in infra/monitoring_alarms.tf:
- CloudWatch metric alarms for API, Lambda, SQS, RDS/Aurora, CloudFront, and EventBridge
- CloudWatch Logs metric filters for Lambda ERROR-level logs
- SNS topic for alert notifications
Global Toggle
All monitoring resources are controlled by one Terraform variable:
enable_cloudwatch_alarm_notifications- Default:
true
When this variable is false, Terraform creates:
- no SNS alarm topic
- no CloudWatch metric alarms
- no CloudWatch log metric filters for Lambda log error detection
Set it in infra/terraform.tfvars:
enable_cloudwatch_alarm_notifications = false
Apply the change:
cd infra
aws-vault exec <profile> -- terraform apply
Alarm Delivery
When monitoring is enabled, alarms publish to an SNS topic:
- Resource:
aws_sns_topic.cloudwatch_alarms - Output:
cloudwatch_alarms_topic_arn
You can retrieve it via:
cd infra
aws-vault exec <profile> -- terraform output cloudwatch_alarms_topic_arn
Add one or more SNS subscriptions (email, webhook bridge, incident tooling) to actually receive notifications.
Alarm Coverage
Lambda
- Invocation errors (
AWS/Lambda Errors) - Throttles (
AWS/Lambda Throttles) - Duration p95 near timeout (
AWS/Lambda Duration) - Log-based error detection via CloudWatch Logs metric filters (ERROR-level JSON fields)
API Gateway
- HTTP API 5xx
- HTTP API p95 latency
- WebSocket API 5xx
SQS
- Main queue visible backlog
- Oldest message age
- DLQ depth
RDS / Aurora
- Database connections high
- ACU utilization high (serverless classes)
- Deadlocks
- Free local storage low (per Aurora instance)
CloudFront
- UI distribution 5xx error rate
- Docs distribution 5xx error rate (only when docs are enabled)
EventBridge
- Org sync schedule failed invocations
Noise Controls
Current defaults are tuned to reduce alert fatigue:
treat_missing_data = "notBreaching"insufficient_data_actions = []ok_actionsonly on critical alarms (for example Lambda invocation errors, DLQ depth, EventBridge failed invocations)
Cost Notes
Main incremental cost drivers are:
- CloudWatch custom metrics generated by log metric filters
- CloudWatch alarms
Log metric filters are low operational overhead, but custom metric count scales with the number of monitored Lambda log groups.