Files
sendico/api/ledger/METRICS.md
Stephan D 62a6631b9a
All checks were successful
ci/woodpecker/push/db Pipeline was successful
ci/woodpecker/push/nats Pipeline was successful
service backend
2025-11-07 18:35:26 +01:00

7.6 KiB

Ledger Service - Prometheus Metrics

Overview

The Ledger service exposes Prometheus metrics on the metrics endpoint (default: :9401/metrics). This provides operational visibility into ledger operations, performance, and errors.

Metrics Endpoint

  • URL: http://localhost:9401/metrics
  • Format: Prometheus exposition format
  • Configuration: Set via config.ymlmetrics.address

Available Metrics

1. Journal Entry Operations

ledger_journal_entries_total

Type: Counter Description: Total number of journal entries posted to the ledger Labels:

  • entry_type: Type of journal entry (credit, debit, transfer, fx, fee, adjust, reverse)
  • status: Operation status (success, error, attempted)

Example:

# Count of successful credit entries
ledger_journal_entries_total{entry_type="credit", status="success"}

# Rate of failed transfers
rate(ledger_journal_entries_total{entry_type="transfer", status="error"}[5m])

ledger_journal_entry_duration_seconds

Type: Histogram Description: Duration of journal entry posting operations Labels:

  • entry_type: Type of journal entry

Buckets: [.001, .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10] seconds

Example:

# 95th percentile latency for credit postings
histogram_quantile(0.95, rate(ledger_journal_entry_duration_seconds_bucket{entry_type="credit"}[5m]))

# Average duration for all entry types
rate(ledger_journal_entry_duration_seconds_sum[5m]) / rate(ledger_journal_entry_duration_seconds_count[5m])

ledger_journal_entry_errors_total

Type: Counter Description: Total number of journal entry posting errors Labels:

  • entry_type: Type of journal entry
  • error_type: Error classification (validation, insufficient_funds, db_error, not_implemented, etc.)

Example:

# Errors by type
sum by (error_type) (ledger_journal_entry_errors_total)

# Validation error rate for transfers
rate(ledger_journal_entry_errors_total{entry_type="transfer", error_type="validation"}[5m])

2. Balance Operations

ledger_balance_queries_total

Type: Counter Description: Total number of balance queries Labels:

  • status: Query status (success, error)

Example:

# Balance query success rate
rate(ledger_balance_queries_total{status="success"}[5m]) / rate(ledger_balance_queries_total[5m])

ledger_balance_query_duration_seconds

Type: Histogram Description: Duration of balance query operations Labels:

  • status: Query status

Example:

# 99th percentile balance query latency
histogram_quantile(0.99, rate(ledger_balance_query_duration_seconds_bucket[5m]))

3. Reversal Operations

ledger_reversals_total

Type: Counter Description: Total number of journal entry reversals Labels:

  • status: Reversal status (success, error)

Example:

# Reversal error rate
rate(ledger_reversals_total{status="error"}[5m])

4. Transaction Amounts

ledger_transaction_amount

Type: Histogram Description: Distribution of transaction amounts (normalized) Labels:

  • currency: Currency code (USD, EUR, GBP, etc.)
  • entry_type: Type of journal entry

Buckets: [1, 10, 50, 100, 500, 1000, 5000, 10000, 50000, 100000]

Example:

# Average transaction amount in USD for credits
rate(ledger_transaction_amount_sum{currency="USD", entry_type="credit"}[5m]) /
rate(ledger_transaction_amount_count{currency="USD", entry_type="credit"}[5m])

# 90th percentile transaction amount
histogram_quantile(0.90, rate(ledger_transaction_amount_bucket[5m]))

5. Account Operations

ledger_account_operations_total

Type: Counter Description: Total number of account-level operations Labels:

  • operation: Operation type (create, freeze, unfreeze)
  • status: Operation status (success, error)

Example:

# Account creation rate
rate(ledger_account_operations_total{operation="create"}[5m])

6. Idempotency

ledger_duplicate_requests_total

Type: Counter Description: Total number of duplicate requests detected via idempotency keys Labels:

  • entry_type: Type of journal entry

Example:

# Duplicate request rate (indicates retry behavior)
rate(ledger_duplicate_requests_total[5m])

# Percentage of duplicate requests
rate(ledger_duplicate_requests_total[5m]) / rate(ledger_journal_entries_total[5m]) * 100

7. gRPC Metrics (Built-in)

These are automatically provided by the gRPC framework:

grpc_server_requests_total

Type: Counter Labels: grpc_service, grpc_method, grpc_type, grpc_code

grpc_server_handling_seconds

Type: Histogram Labels: grpc_service, grpc_method, grpc_type, grpc_code

Example:

# gRPC error rate by method
rate(grpc_server_requests_total{grpc_code!="OK"}[5m])

# P95 latency for PostCredit RPC
histogram_quantile(0.95, rate(grpc_server_handling_seconds_bucket{grpc_method="PostCreditWithCharges"}[5m]))

Common Queries

Health & Availability

# Overall request rate
sum(rate(grpc_server_requests_total[5m]))

# Error rate (all operations)
sum(rate(ledger_journal_entry_errors_total[5m]))

# Success rate for journal entries
sum(rate(ledger_journal_entries_total{status="success"}[5m])) / sum(rate(ledger_journal_entries_total[5m]))

Performance

# P99 latency for all journal entry types
histogram_quantile(0.99, sum(rate(ledger_journal_entry_duration_seconds_bucket[5m])) by (le, entry_type))

# Slowest operation types
topk(5, avg by (entry_type) (rate(ledger_journal_entry_duration_seconds_sum[5m]) / rate(ledger_journal_entry_duration_seconds_count[5m])))

Business Insights

# Transaction volume by type
sum by (entry_type) (rate(ledger_journal_entries_total{status="success"}[1h]))

# Total money flow (sum of transaction amounts)
sum(rate(ledger_transaction_amount_sum[5m]))

# Most common error types
topk(10, sum by (error_type) (rate(ledger_journal_entry_errors_total[5m])))

Grafana Dashboard

  1. Request Rate - sum(rate(grpc_server_requests_total[5m]))
  2. Error Rate - sum(rate(grpc_server_requests_total{grpc_code!="OK"}[5m]))
  3. P95/P99 Latency - Histogram quantiles
  4. Operations by Type - Stacked graph of ledger_journal_entries_total
  5. Error Breakdown - Pie chart of ledger_journal_entry_errors_total by error_type
  6. Transaction Volume - Counter of successful entries
  7. Duplicate Requests - ledger_duplicate_requests_total rate

Alerting Rules

Critical

# High error rate
- alert: LedgerHighErrorRate
  expr: rate(ledger_journal_entry_errors_total[5m]) > 10
  for: 5m
  labels:
    severity: critical

# Service unavailable
- alert: LedgerServiceDown
  expr: up{job="ledger"} == 0
  for: 1m
  labels:
    severity: critical

Warning

# Slow operations
- alert: LedgerSlowOperations
  expr: histogram_quantile(0.95, rate(ledger_journal_entry_duration_seconds_bucket[5m])) > 1
  for: 10m
  labels:
    severity: warning

# High duplicate request rate (potential retry storm)
- alert: LedgerHighDuplicateRate
  expr: rate(ledger_duplicate_requests_total[5m]) / rate(ledger_journal_entries_total[5m]) > 0.2
  for: 5m
  labels:
    severity: warning

Configuration

Metrics are configured in config.yml:

metrics:
  address: ":9401"  # Metrics HTTP server address

Dependencies

  • Prometheus client library: github.com/prometheus/client_golang
  • All metrics are registered globally and exposed via /metrics endpoint