# Ledger Service - Prometheus Metrics ## Overview The Ledger service exposes Prometheus metrics on the metrics endpoint (default: `:9401/metrics`). This provides operational visibility into ledger operations, performance, and errors. ## Metrics Endpoint - **URL**: `http://localhost:9401/metrics` - **Format**: Prometheus exposition format - **Configuration**: Set via `config.yml` → `metrics.address` ## Available Metrics ### 1. Journal Entry Operations #### `ledger_journal_entries_total` **Type**: Counter **Description**: Total number of journal entries posted to the ledger **Labels**: - `entry_type`: Type of journal entry (`credit`, `debit`, `transfer`, `fx`, `fee`, `adjust`, `reverse`) - `status`: Operation status (`success`, `error`, `attempted`) **Example**: ```promql # Count of successful credit entries ledger_journal_entries_total{entry_type="credit", status="success"} # Rate of failed transfers rate(ledger_journal_entries_total{entry_type="transfer", status="error"}[5m]) ``` --- #### `ledger_journal_entry_duration_seconds` **Type**: Histogram **Description**: Duration of journal entry posting operations **Labels**: - `entry_type`: Type of journal entry **Buckets**: `[.001, .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]` seconds **Example**: ```promql # 95th percentile latency for credit postings histogram_quantile(0.95, rate(ledger_journal_entry_duration_seconds_bucket{entry_type="credit"}[5m])) # Average duration for all entry types rate(ledger_journal_entry_duration_seconds_sum[5m]) / rate(ledger_journal_entry_duration_seconds_count[5m]) ``` --- #### `ledger_journal_entry_errors_total` **Type**: Counter **Description**: Total number of journal entry posting errors **Labels**: - `entry_type`: Type of journal entry - `error_type`: Error classification (`validation`, `insufficient_funds`, `db_error`, `not_implemented`, etc.) **Example**: ```promql # Errors by type sum by (error_type) (ledger_journal_entry_errors_total) # Validation error rate for transfers rate(ledger_journal_entry_errors_total{entry_type="transfer", error_type="validation"}[5m]) ``` --- ### 2. Balance Operations #### `ledger_balance_queries_total` **Type**: Counter **Description**: Total number of balance queries **Labels**: - `status`: Query status (`success`, `error`) **Example**: ```promql # Balance query success rate rate(ledger_balance_queries_total{status="success"}[5m]) / rate(ledger_balance_queries_total[5m]) ``` --- #### `ledger_balance_query_duration_seconds` **Type**: Histogram **Description**: Duration of balance query operations **Labels**: - `status`: Query status **Example**: ```promql # 99th percentile balance query latency histogram_quantile(0.99, rate(ledger_balance_query_duration_seconds_bucket[5m])) ``` --- ### 3. Reversal Operations #### `ledger_reversals_total` **Type**: Counter **Description**: Total number of journal entry reversals **Labels**: - `status`: Reversal status (`success`, `error`) **Example**: ```promql # Reversal error rate rate(ledger_reversals_total{status="error"}[5m]) ``` --- ### 4. Transaction Amounts #### `ledger_transaction_amount` **Type**: Histogram **Description**: Distribution of transaction amounts (normalized) **Labels**: - `currency`: Currency code (`USD`, `EUR`, `GBP`, etc.) - `entry_type`: Type of journal entry **Buckets**: `[1, 10, 50, 100, 500, 1000, 5000, 10000, 50000, 100000]` **Example**: ```promql # Average transaction amount in USD for credits rate(ledger_transaction_amount_sum{currency="USD", entry_type="credit"}[5m]) / rate(ledger_transaction_amount_count{currency="USD", entry_type="credit"}[5m]) # 90th percentile transaction amount histogram_quantile(0.90, rate(ledger_transaction_amount_bucket[5m])) ``` --- ### 5. Account Operations #### `ledger_account_operations_total` **Type**: Counter **Description**: Total number of account-level operations **Labels**: - `operation`: Operation type (`create`, `freeze`, `unfreeze`) - `status`: Operation status (`success`, `error`) **Example**: ```promql # Account creation rate rate(ledger_account_operations_total{operation="create"}[5m]) ``` --- ### 6. Idempotency #### `ledger_duplicate_requests_total` **Type**: Counter **Description**: Total number of duplicate requests detected via idempotency keys **Labels**: - `entry_type`: Type of journal entry **Example**: ```promql # Duplicate request rate (indicates retry behavior) rate(ledger_duplicate_requests_total[5m]) # Percentage of duplicate requests rate(ledger_duplicate_requests_total[5m]) / rate(ledger_journal_entries_total[5m]) * 100 ``` --- ### 7. gRPC Metrics (Built-in) These are automatically provided by the gRPC framework: #### `grpc_server_requests_total` **Type**: Counter **Labels**: `grpc_service`, `grpc_method`, `grpc_type`, `grpc_code` #### `grpc_server_handling_seconds` **Type**: Histogram **Labels**: `grpc_service`, `grpc_method`, `grpc_type`, `grpc_code` **Example**: ```promql # gRPC error rate by method rate(grpc_server_requests_total{grpc_code!="OK"}[5m]) # P95 latency for PostCredit RPC histogram_quantile(0.95, rate(grpc_server_handling_seconds_bucket{grpc_method="PostCreditWithCharges"}[5m])) ``` --- ## Common Queries ### Health & Availability ```promql # Overall request rate sum(rate(grpc_server_requests_total[5m])) # Error rate (all operations) sum(rate(ledger_journal_entry_errors_total[5m])) # Success rate for journal entries sum(rate(ledger_journal_entries_total{status="success"}[5m])) / sum(rate(ledger_journal_entries_total[5m])) ``` ### Performance ```promql # P99 latency for all journal entry types histogram_quantile(0.99, sum(rate(ledger_journal_entry_duration_seconds_bucket[5m])) by (le, entry_type)) # Slowest operation types topk(5, avg by (entry_type) (rate(ledger_journal_entry_duration_seconds_sum[5m]) / rate(ledger_journal_entry_duration_seconds_count[5m]))) ``` ### Business Insights ```promql # Transaction volume by type sum by (entry_type) (rate(ledger_journal_entries_total{status="success"}[1h])) # Total money flow (sum of transaction amounts) sum(rate(ledger_transaction_amount_sum[5m])) # Most common error types topk(10, sum by (error_type) (rate(ledger_journal_entry_errors_total[5m]))) ``` --- ## Grafana Dashboard ### Recommended Panels 1. **Request Rate** - `sum(rate(grpc_server_requests_total[5m]))` 2. **Error Rate** - `sum(rate(grpc_server_requests_total{grpc_code!="OK"}[5m]))` 3. **P95/P99 Latency** - Histogram quantiles 4. **Operations by Type** - Stacked graph of `ledger_journal_entries_total` 5. **Error Breakdown** - Pie chart of `ledger_journal_entry_errors_total` by `error_type` 6. **Transaction Volume** - Counter of successful entries 7. **Duplicate Requests** - `ledger_duplicate_requests_total` rate --- ## Alerting Rules ### Critical ```yaml # High error rate - alert: LedgerHighErrorRate expr: rate(ledger_journal_entry_errors_total[5m]) > 10 for: 5m labels: severity: critical # Service unavailable - alert: LedgerServiceDown expr: up{job="ledger"} == 0 for: 1m labels: severity: critical ``` ### Warning ```yaml # Slow operations - alert: LedgerSlowOperations expr: histogram_quantile(0.95, rate(ledger_journal_entry_duration_seconds_bucket[5m])) > 1 for: 10m labels: severity: warning # High duplicate request rate (potential retry storm) - alert: LedgerHighDuplicateRate expr: rate(ledger_duplicate_requests_total[5m]) / rate(ledger_journal_entries_total[5m]) > 0.2 for: 5m labels: severity: warning ``` --- ## Configuration Metrics are configured in `config.yml`: ```yaml metrics: address: ":9401" # Metrics HTTP server address ``` ## Dependencies - Prometheus client library: `github.com/prometheus/client_golang` - All metrics are registered globally and exposed via `/metrics` endpoint