Infrastructure Monitoring & Alerting

GitGazer includes Terraform-managed infrastructure monitoring built on Amazon CloudWatch and SNS.

Overview

Monitoring resources are defined in infra/monitoring_alarms.tf:

CloudWatch metric alarms for API, Lambda, SQS, RDS/Aurora, CloudFront, and EventBridge
CloudWatch Logs metric filters for Lambda ERROR-level logs
SNS topic for alert notifications

Global Toggle

All monitoring resources are controlled by one Terraform variable:

enable_cloudwatch_alarm_notifications
Default: true

When this variable is false, Terraform creates:

no SNS alarm topic
no CloudWatch metric alarms
no CloudWatch log metric filters for Lambda log error detection

Set it in infra/terraform.tfvars:

enable_cloudwatch_alarm_notifications = false

Apply the change:

cd infra
aws-vault exec <profile> -- terraform apply

Alarm Delivery

When monitoring is enabled, alarms publish to an SNS topic:

Resource: aws_sns_topic.cloudwatch_alarms
Output: cloudwatch_alarms_topic_arn

You can retrieve it via:

cd infra
aws-vault exec <profile> -- terraform output cloudwatch_alarms_topic_arn

Add one or more SNS subscriptions (email, webhook bridge, incident tooling) to actually receive notifications.

Alarm Coverage

Lambda

Invocation errors (AWS/Lambda Errors)
Throttles (AWS/Lambda Throttles)
Duration p95 near timeout (AWS/Lambda Duration)
Log-based error detection via CloudWatch Logs metric filters (ERROR-level JSON fields)

API Gateway

HTTP API 5xx
HTTP API p95 latency
WebSocket API 5xx

SQS

Main queue visible backlog
Oldest message age
DLQ depth

RDS / Aurora

Database connections high
ACU utilization high (serverless classes)
Deadlocks
Free local storage low (per Aurora instance)

CloudFront

UI distribution 5xx error rate
Docs distribution 5xx error rate (only when docs are enabled)

EventBridge

Org sync schedule failed invocations

Noise Controls

Current defaults are tuned to reduce alert fatigue:

treat_missing_data = "notBreaching"
insufficient_data_actions = []
ok_actions only on critical alarms (for example Lambda invocation errors, DLQ depth, EventBridge failed invocations)

Cost Notes

Main incremental cost drivers are:

CloudWatch custom metrics generated by log metric filters
CloudWatch alarms

Log metric filters are low operational overhead, but custom metric count scales with the number of monitored Lambda log groups.

Overview​

Global Toggle​

Alarm Delivery​

Alarm Coverage​

Lambda​

API Gateway​

SQS​

RDS / Aurora​

CloudFront​

EventBridge​

Noise Controls​

Cost Notes​