7.6 KiB
Ledger Service - Prometheus Metrics
Overview
The Ledger service exposes Prometheus metrics on the metrics endpoint (default: :9401/metrics). This provides operational visibility into ledger operations, performance, and errors.
Metrics Endpoint
- URL:
http://localhost:9401/metrics - Format: Prometheus exposition format
- Configuration: Set via
config.yml→metrics.address
Available Metrics
1. Journal Entry Operations
ledger_journal_entries_total
Type: Counter Description: Total number of journal entries posted to the ledger Labels:
entry_type: Type of journal entry (credit,debit,transfer,fx,fee,adjust,reverse)status: Operation status (success,error,attempted)
Example:
# Count of successful credit entries
ledger_journal_entries_total{entry_type="credit", status="success"}
# Rate of failed transfers
rate(ledger_journal_entries_total{entry_type="transfer", status="error"}[5m])
ledger_journal_entry_duration_seconds
Type: Histogram Description: Duration of journal entry posting operations Labels:
entry_type: Type of journal entry
Buckets: [.001, .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10] seconds
Example:
# 95th percentile latency for credit postings
histogram_quantile(0.95, rate(ledger_journal_entry_duration_seconds_bucket{entry_type="credit"}[5m]))
# Average duration for all entry types
rate(ledger_journal_entry_duration_seconds_sum[5m]) / rate(ledger_journal_entry_duration_seconds_count[5m])
ledger_journal_entry_errors_total
Type: Counter Description: Total number of journal entry posting errors Labels:
entry_type: Type of journal entryerror_type: Error classification (validation,insufficient_funds,db_error,not_implemented, etc.)
Example:
# Errors by type
sum by (error_type) (ledger_journal_entry_errors_total)
# Validation error rate for transfers
rate(ledger_journal_entry_errors_total{entry_type="transfer", error_type="validation"}[5m])
2. Balance Operations
ledger_balance_queries_total
Type: Counter Description: Total number of balance queries Labels:
status: Query status (success,error)
Example:
# Balance query success rate
rate(ledger_balance_queries_total{status="success"}[5m]) / rate(ledger_balance_queries_total[5m])
ledger_balance_query_duration_seconds
Type: Histogram Description: Duration of balance query operations Labels:
status: Query status
Example:
# 99th percentile balance query latency
histogram_quantile(0.99, rate(ledger_balance_query_duration_seconds_bucket[5m]))
3. Reversal Operations
ledger_reversals_total
Type: Counter Description: Total number of journal entry reversals Labels:
status: Reversal status (success,error)
Example:
# Reversal error rate
rate(ledger_reversals_total{status="error"}[5m])
4. Transaction Amounts
ledger_transaction_amount
Type: Histogram Description: Distribution of transaction amounts (normalized) Labels:
currency: Currency code (USD,EUR,GBP, etc.)entry_type: Type of journal entry
Buckets: [1, 10, 50, 100, 500, 1000, 5000, 10000, 50000, 100000]
Example:
# Average transaction amount in USD for credits
rate(ledger_transaction_amount_sum{currency="USD", entry_type="credit"}[5m]) /
rate(ledger_transaction_amount_count{currency="USD", entry_type="credit"}[5m])
# 90th percentile transaction amount
histogram_quantile(0.90, rate(ledger_transaction_amount_bucket[5m]))
5. Account Operations
ledger_account_operations_total
Type: Counter Description: Total number of account-level operations Labels:
operation: Operation type (create,freeze,unfreeze)status: Operation status (success,error)
Example:
# Account creation rate
rate(ledger_account_operations_total{operation="create"}[5m])
6. Idempotency
ledger_duplicate_requests_total
Type: Counter Description: Total number of duplicate requests detected via idempotency keys Labels:
entry_type: Type of journal entry
Example:
# Duplicate request rate (indicates retry behavior)
rate(ledger_duplicate_requests_total[5m])
# Percentage of duplicate requests
rate(ledger_duplicate_requests_total[5m]) / rate(ledger_journal_entries_total[5m]) * 100
7. gRPC Metrics (Built-in)
These are automatically provided by the gRPC framework:
grpc_server_requests_total
Type: Counter
Labels: grpc_service, grpc_method, grpc_type, grpc_code
grpc_server_handling_seconds
Type: Histogram
Labels: grpc_service, grpc_method, grpc_type, grpc_code
Example:
# gRPC error rate by method
rate(grpc_server_requests_total{grpc_code!="OK"}[5m])
# P95 latency for PostCredit RPC
histogram_quantile(0.95, rate(grpc_server_handling_seconds_bucket{grpc_method="PostCreditWithCharges"}[5m]))
Common Queries
Health & Availability
# Overall request rate
sum(rate(grpc_server_requests_total[5m]))
# Error rate (all operations)
sum(rate(ledger_journal_entry_errors_total[5m]))
# Success rate for journal entries
sum(rate(ledger_journal_entries_total{status="success"}[5m])) / sum(rate(ledger_journal_entries_total[5m]))
Performance
# P99 latency for all journal entry types
histogram_quantile(0.99, sum(rate(ledger_journal_entry_duration_seconds_bucket[5m])) by (le, entry_type))
# Slowest operation types
topk(5, avg by (entry_type) (rate(ledger_journal_entry_duration_seconds_sum[5m]) / rate(ledger_journal_entry_duration_seconds_count[5m])))
Business Insights
# Transaction volume by type
sum by (entry_type) (rate(ledger_journal_entries_total{status="success"}[1h]))
# Total money flow (sum of transaction amounts)
sum(rate(ledger_transaction_amount_sum[5m]))
# Most common error types
topk(10, sum by (error_type) (rate(ledger_journal_entry_errors_total[5m])))
Grafana Dashboard
Recommended Panels
- Request Rate -
sum(rate(grpc_server_requests_total[5m])) - Error Rate -
sum(rate(grpc_server_requests_total{grpc_code!="OK"}[5m])) - P95/P99 Latency - Histogram quantiles
- Operations by Type - Stacked graph of
ledger_journal_entries_total - Error Breakdown - Pie chart of
ledger_journal_entry_errors_totalbyerror_type - Transaction Volume - Counter of successful entries
- Duplicate Requests -
ledger_duplicate_requests_totalrate
Alerting Rules
Critical
# High error rate
- alert: LedgerHighErrorRate
expr: rate(ledger_journal_entry_errors_total[5m]) > 10
for: 5m
labels:
severity: critical
# Service unavailable
- alert: LedgerServiceDown
expr: up{job="ledger"} == 0
for: 1m
labels:
severity: critical
Warning
# Slow operations
- alert: LedgerSlowOperations
expr: histogram_quantile(0.95, rate(ledger_journal_entry_duration_seconds_bucket[5m])) > 1
for: 10m
labels:
severity: warning
# High duplicate request rate (potential retry storm)
- alert: LedgerHighDuplicateRate
expr: rate(ledger_duplicate_requests_total[5m]) / rate(ledger_journal_entries_total[5m]) > 0.2
for: 5m
labels:
severity: warning
Configuration
Metrics are configured in config.yml:
metrics:
address: ":9401" # Metrics HTTP server address
Dependencies
- Prometheus client library:
github.com/prometheus/client_golang - All metrics are registered globally and exposed via
/metricsendpoint