cosmos-sdk/docs/architecture/adr-013-metrics.md

158 lines
5.0 KiB
Markdown
Raw Normal View History

# ADR 013: Observability
## Changelog
- 20-01-2020: Initial Draft
## Status
Proposed
## Context
2020-06-16 08:11:02 -07:00
Telemetry is paramount into debugging and understanding what the application is doing and how it is
performing. We aim to expose metrics from modules and other core parts of the Cosmos SDK.
2020-06-16 08:11:02 -07:00
In addition, we should aim to support multiple configurable sinks that an operator may choose from.
By default, when telemetry is enabled, the application should track and expose metrics that are
stored in-memory. The operator may choose to enable additional sinks, where we support only
[Prometheus](https://prometheus.io/) for now, as it's battle-tested, simple to setup, open source,
and is rich with ecosystem tooling.
We must also aim to integrate metrics into the Cosmos SDK in the most seamless way possible such that
metrics may be added or removed at will and without much friction. To do this, we will use the
[go-metrics](https://github.com/armon/go-metrics) library.
Finally, operators may enable telemetry along with specific configuration options. If enabled, metrics
will be exposed via `/metrics?format={text|prometheus}` via the API server.
## Decision
We will add an additional configuration block to `app.toml` that defines telemetry settings:
```toml
###############################################################################
### Telemetry Configuration ###
###############################################################################
[telemetry]
# Prefixed with keys to separate services
service-name = {{ .Telemetry.ServiceName }}
# Enabled enables the application telemetry functionality. When enabled,
# an in-memory sink is also enabled by default. Operators may also enabled
# other sinks such as Prometheus.
enabled = {{ .Telemetry.Enabled }}
# Enable prefixing gauge values with hostname
enable-hostname = {{ .Telemetry.EnableHostname }}
# Enable adding hostname to labels
enable-hostname-label = {{ .Telemetry.EnableHostnameLabel }}
# Enable adding service to labels
enable-service-label = {{ .Telemetry.EnableServiceLabel }}
# PrometheusRetentionTime, when positive, enables a Prometheus metrics sink.
prometheus-retention-time = {{ .Telemetry.PrometheusRetentionTime }}
```
The given configuration allows for two sinks -- in-memory and Prometheus. We create a `Metrics`
type that performs all the bootstrapping for the operator, so capturing metrics becomes seamless.
```go
2020-06-16 08:11:02 -07:00
// Metrics defines a wrapper around application telemetry functionality. It allows
// metrics to be gathered at any point in time. When creating a Metrics object,
// internally, a global metrics is registered with a set of sinks as configured
// by the operator. In addition to the sinks, when a process gets a SIGUSR1, a
// dump of formatted recent metrics will be sent to STDERR.
type Metrics struct {
memSink *metrics.InmemSink
prometheusEnabled bool
}
2020-06-16 08:11:02 -07:00
// Gather collects all registered metrics and returns a GatherResponse where the
// metrics are encoded depending on the type. Metrics are either encoded via
// Prometheus or JSON if in-memory.
func (m *Metrics) Gather(format string) (GatherResponse, error) {
switch format {
case FormatPrometheus:
return m.gatherPrometheus()
case FormatText:
return m.gatherGeneric()
case FormatDefault:
return m.gatherGeneric()
default:
return GatherResponse{}, fmt.Errorf("unsupported metrics format: %s", format)
}
}
```
2020-06-16 08:11:02 -07:00
In addition, `Metrics` allows us to gather the current set of metrics at any given point in time. An
operator may also choose to send a signal, SIGUSR1, to dump and print formatted metrics to STDERR.
During an application's bootstrapping and construction phase, if `Telemetry.Enabled` is `true`, the
API server will create an instance of a reference to `Metrics` object and will register a metrics
handler accordingly.
```go
2020-06-16 08:11:02 -07:00
func (s *Server) Start(cfg config.Config) error {
// ...
2020-06-16 08:11:02 -07:00
if cfg.Telemetry.Enabled {
m, err := telemetry.New(cfg.Telemetry)
if err != nil {
return err
}
2020-06-16 08:11:02 -07:00
s.metrics = m
s.registerMetrics()
}
2020-06-16 08:11:02 -07:00
// ...
}
func (s *Server) registerMetrics() {
metricsHandler := func(w http.ResponseWriter, r *http.Request) {
format := strings.TrimSpace(r.FormValue("format"))
gr, err := s.metrics.Gather(format)
if err != nil {
rest.WriteErrorResponse(w, http.StatusBadRequest, fmt.Sprintf("failed to gather metrics: %s", err))
return
}
w.Header().Set("Content-Type", gr.ContentType)
_, _ = w.Write(gr.Metrics)
}
2020-06-16 08:11:02 -07:00
s.Router.HandleFunc("/metrics", metricsHandler).Methods("GET")
}
```
2020-06-16 08:11:02 -07:00
Application developers may track counters, gauges, summaries, and key/value metrics. There is no
additional lifting required by modules to leverage profiling metrics. To do so, it's as simple as:
```go
2020-06-16 08:11:02 -07:00
func (k BaseKeeper) MintCoins(ctx sdk.Context, moduleName string, amt sdk.Coins) error {
defer metrics.MeasureSince(time.Now(), "MintCoins")
2020-06-16 08:11:02 -07:00
// ...
}
```
## Consequences
### Positive
2020-06-16 08:11:02 -07:00
- Exposure into the performance and behavior of an application
### Negative
### Neutral
## References