307 lines
7.6 KiB
Markdown
307 lines
7.6 KiB
Markdown
# Ledger Service - Prometheus Metrics
|
|
|
|
## Overview
|
|
|
|
The Ledger service exposes Prometheus metrics on the metrics endpoint (default: `:9401/metrics`). This provides operational visibility into ledger operations, performance, and errors.
|
|
|
|
## Metrics Endpoint
|
|
|
|
- **URL**: `http://localhost:9401/metrics`
|
|
- **Format**: Prometheus exposition format
|
|
- **Configuration**: Set via `config.yml` → `metrics.address`
|
|
|
|
## Available Metrics
|
|
|
|
### 1. Journal Entry Operations
|
|
|
|
#### `ledger_journal_entries_total`
|
|
**Type**: Counter
|
|
**Description**: Total number of journal entries posted to the ledger
|
|
**Labels**:
|
|
- `entry_type`: Type of journal entry (`credit`, `debit`, `transfer`, `fx`, `fee`, `adjust`, `reverse`)
|
|
- `status`: Operation status (`success`, `error`, `attempted`)
|
|
|
|
**Example**:
|
|
```promql
|
|
# Count of successful credit entries
|
|
ledger_journal_entries_total{entry_type="credit", status="success"}
|
|
|
|
# Rate of failed transfers
|
|
rate(ledger_journal_entries_total{entry_type="transfer", status="error"}[5m])
|
|
```
|
|
|
|
---
|
|
|
|
#### `ledger_journal_entry_duration_seconds`
|
|
**Type**: Histogram
|
|
**Description**: Duration of journal entry posting operations
|
|
**Labels**:
|
|
- `entry_type`: Type of journal entry
|
|
|
|
**Buckets**: `[.001, .005, .01, .025, .05, .1, .25, .5, 1, 2.5, 5, 10]` seconds
|
|
|
|
**Example**:
|
|
```promql
|
|
# 95th percentile latency for credit postings
|
|
histogram_quantile(0.95, rate(ledger_journal_entry_duration_seconds_bucket{entry_type="credit"}[5m]))
|
|
|
|
# Average duration for all entry types
|
|
rate(ledger_journal_entry_duration_seconds_sum[5m]) / rate(ledger_journal_entry_duration_seconds_count[5m])
|
|
```
|
|
|
|
---
|
|
|
|
#### `ledger_journal_entry_errors_total`
|
|
**Type**: Counter
|
|
**Description**: Total number of journal entry posting errors
|
|
**Labels**:
|
|
- `entry_type`: Type of journal entry
|
|
- `error_type`: Error classification (`validation`, `insufficient_funds`, `db_error`, `not_implemented`, etc.)
|
|
|
|
**Example**:
|
|
```promql
|
|
# Errors by type
|
|
sum by (error_type) (ledger_journal_entry_errors_total)
|
|
|
|
# Validation error rate for transfers
|
|
rate(ledger_journal_entry_errors_total{entry_type="transfer", error_type="validation"}[5m])
|
|
```
|
|
|
|
---
|
|
|
|
### 2. Balance Operations
|
|
|
|
#### `ledger_balance_queries_total`
|
|
**Type**: Counter
|
|
**Description**: Total number of balance queries
|
|
**Labels**:
|
|
- `status`: Query status (`success`, `error`)
|
|
|
|
**Example**:
|
|
```promql
|
|
# Balance query success rate
|
|
rate(ledger_balance_queries_total{status="success"}[5m]) / rate(ledger_balance_queries_total[5m])
|
|
```
|
|
|
|
---
|
|
|
|
#### `ledger_balance_query_duration_seconds`
|
|
**Type**: Histogram
|
|
**Description**: Duration of balance query operations
|
|
**Labels**:
|
|
- `status`: Query status
|
|
|
|
**Example**:
|
|
```promql
|
|
# 99th percentile balance query latency
|
|
histogram_quantile(0.99, rate(ledger_balance_query_duration_seconds_bucket[5m]))
|
|
```
|
|
|
|
---
|
|
|
|
### 3. Reversal Operations
|
|
|
|
#### `ledger_reversals_total`
|
|
**Type**: Counter
|
|
**Description**: Total number of journal entry reversals
|
|
**Labels**:
|
|
- `status`: Reversal status (`success`, `error`)
|
|
|
|
**Example**:
|
|
```promql
|
|
# Reversal error rate
|
|
rate(ledger_reversals_total{status="error"}[5m])
|
|
```
|
|
|
|
---
|
|
|
|
### 4. Transaction Amounts
|
|
|
|
#### `ledger_transaction_amount`
|
|
**Type**: Histogram
|
|
**Description**: Distribution of transaction amounts (normalized)
|
|
**Labels**:
|
|
- `currency`: Currency code (`USD`, `EUR`, `GBP`, etc.)
|
|
- `entry_type`: Type of journal entry
|
|
|
|
**Buckets**: `[1, 10, 50, 100, 500, 1000, 5000, 10000, 50000, 100000]`
|
|
|
|
**Example**:
|
|
```promql
|
|
# Average transaction amount in USD for credits
|
|
rate(ledger_transaction_amount_sum{currency="USD", entry_type="credit"}[5m]) /
|
|
rate(ledger_transaction_amount_count{currency="USD", entry_type="credit"}[5m])
|
|
|
|
# 90th percentile transaction amount
|
|
histogram_quantile(0.90, rate(ledger_transaction_amount_bucket[5m]))
|
|
```
|
|
|
|
---
|
|
|
|
### 5. Account Operations
|
|
|
|
#### `ledger_account_operations_total`
|
|
**Type**: Counter
|
|
**Description**: Total number of account-level operations
|
|
**Labels**:
|
|
- `operation`: Operation type (`create`, `freeze`, `unfreeze`)
|
|
- `status`: Operation status (`success`, `error`)
|
|
|
|
**Example**:
|
|
```promql
|
|
# Account creation rate
|
|
rate(ledger_account_operations_total{operation="create"}[5m])
|
|
```
|
|
|
|
---
|
|
|
|
### 6. Idempotency
|
|
|
|
#### `ledger_duplicate_requests_total`
|
|
**Type**: Counter
|
|
**Description**: Total number of duplicate requests detected via idempotency keys
|
|
**Labels**:
|
|
- `entry_type`: Type of journal entry
|
|
|
|
**Example**:
|
|
```promql
|
|
# Duplicate request rate (indicates retry behavior)
|
|
rate(ledger_duplicate_requests_total[5m])
|
|
|
|
# Percentage of duplicate requests
|
|
rate(ledger_duplicate_requests_total[5m]) / rate(ledger_journal_entries_total[5m]) * 100
|
|
```
|
|
|
|
---
|
|
|
|
### 7. gRPC Metrics (Built-in)
|
|
|
|
These are automatically provided by the gRPC framework:
|
|
|
|
#### `grpc_server_requests_total`
|
|
**Type**: Counter
|
|
**Labels**: `grpc_service`, `grpc_method`, `grpc_type`, `grpc_code`
|
|
|
|
#### `grpc_server_handling_seconds`
|
|
**Type**: Histogram
|
|
**Labels**: `grpc_service`, `grpc_method`, `grpc_type`, `grpc_code`
|
|
|
|
**Example**:
|
|
```promql
|
|
# gRPC error rate by method
|
|
rate(grpc_server_requests_total{grpc_code!="OK"}[5m])
|
|
|
|
# P95 latency for PostCredit RPC
|
|
histogram_quantile(0.95, rate(grpc_server_handling_seconds_bucket{grpc_method="PostCreditWithCharges"}[5m]))
|
|
```
|
|
|
|
---
|
|
|
|
## Common Queries
|
|
|
|
### Health & Availability
|
|
|
|
```promql
|
|
# Overall request rate
|
|
sum(rate(grpc_server_requests_total[5m]))
|
|
|
|
# Error rate (all operations)
|
|
sum(rate(ledger_journal_entry_errors_total[5m]))
|
|
|
|
# Success rate for journal entries
|
|
sum(rate(ledger_journal_entries_total{status="success"}[5m])) / sum(rate(ledger_journal_entries_total[5m]))
|
|
```
|
|
|
|
### Performance
|
|
|
|
```promql
|
|
# P99 latency for all journal entry types
|
|
histogram_quantile(0.99, sum(rate(ledger_journal_entry_duration_seconds_bucket[5m])) by (le, entry_type))
|
|
|
|
# Slowest operation types
|
|
topk(5, avg by (entry_type) (rate(ledger_journal_entry_duration_seconds_sum[5m]) / rate(ledger_journal_entry_duration_seconds_count[5m])))
|
|
```
|
|
|
|
### Business Insights
|
|
|
|
```promql
|
|
# Transaction volume by type
|
|
sum by (entry_type) (rate(ledger_journal_entries_total{status="success"}[1h]))
|
|
|
|
# Total money flow (sum of transaction amounts)
|
|
sum(rate(ledger_transaction_amount_sum[5m]))
|
|
|
|
# Most common error types
|
|
topk(10, sum by (error_type) (rate(ledger_journal_entry_errors_total[5m])))
|
|
```
|
|
|
|
---
|
|
|
|
## Grafana Dashboard
|
|
|
|
### Recommended Panels
|
|
|
|
1. **Request Rate** - `sum(rate(grpc_server_requests_total[5m]))`
|
|
2. **Error Rate** - `sum(rate(grpc_server_requests_total{grpc_code!="OK"}[5m]))`
|
|
3. **P95/P99 Latency** - Histogram quantiles
|
|
4. **Operations by Type** - Stacked graph of `ledger_journal_entries_total`
|
|
5. **Error Breakdown** - Pie chart of `ledger_journal_entry_errors_total` by `error_type`
|
|
6. **Transaction Volume** - Counter of successful entries
|
|
7. **Duplicate Requests** - `ledger_duplicate_requests_total` rate
|
|
|
|
---
|
|
|
|
## Alerting Rules
|
|
|
|
### Critical
|
|
|
|
```yaml
|
|
# High error rate
|
|
- alert: LedgerHighErrorRate
|
|
expr: rate(ledger_journal_entry_errors_total[5m]) > 10
|
|
for: 5m
|
|
labels:
|
|
severity: critical
|
|
|
|
# Service unavailable
|
|
- alert: LedgerServiceDown
|
|
expr: up{job="ledger"} == 0
|
|
for: 1m
|
|
labels:
|
|
severity: critical
|
|
```
|
|
|
|
### Warning
|
|
|
|
```yaml
|
|
# Slow operations
|
|
- alert: LedgerSlowOperations
|
|
expr: histogram_quantile(0.95, rate(ledger_journal_entry_duration_seconds_bucket[5m])) > 1
|
|
for: 10m
|
|
labels:
|
|
severity: warning
|
|
|
|
# High duplicate request rate (potential retry storm)
|
|
- alert: LedgerHighDuplicateRate
|
|
expr: rate(ledger_duplicate_requests_total[5m]) / rate(ledger_journal_entries_total[5m]) > 0.2
|
|
for: 5m
|
|
labels:
|
|
severity: warning
|
|
```
|
|
|
|
---
|
|
|
|
## Configuration
|
|
|
|
Metrics are configured in `config.yml`:
|
|
|
|
```yaml
|
|
metrics:
|
|
address: ":9401" # Metrics HTTP server address
|
|
```
|
|
|
|
## Dependencies
|
|
|
|
- Prometheus client library: `github.com/prometheus/client_golang`
|
|
- All metrics are registered globally and exposed via `/metrics` endpoint
|