Fix and improve quota monitor blueprint (#1488)

* quota monitoring blueprint fixes

* wip

* wip

* quota

* improvements

* improve variables

* refactor http code

* fix http post

* improve logging

* fix project creation, improve readme

* fix test

* Update main.py

* remove unneeded constant

* exit with http error message instead of json when failing to decode api response

* actually do what previous commit wanted :)

* nits
This commit is contained in:
Ludovico Magnocavallo 2023-07-03 09:23:49 +02:00 committed by GitHub
parent 86cc6eee4c
commit 0bc6dffce0
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
9 changed files with 449 additions and 306 deletions

View File

@ -1,38 +1,55 @@
# Compute Engine quota monitoring
This blueprint improves on the [GCE quota exporter tool](https://github.com/GoogleCloudPlatform/professional-services/tree/master/tools/gce-quota-sync) (by the same author of this blueprint), and shows a practical way of collecting and monitoring [Compute Engine resource quotas](https://cloud.google.com/compute/quotas) via Cloud Monitoring metrics as an alternative to the recently released [built-in quota metrics](https://cloud.google.com/monitoring/alerts/using-quota-metrics).
This blueprint improves on the [GCE quota exporter tool](https://github.com/GoogleCloudPlatform/professional-services/tree/master/tools/gce-quota-sync) (by the same author of this blueprint), and shows a practical way of collecting and monitoring [Compute Engine resource quotas](https://cloud.google.com/compute/quotas) via Cloud Monitoring metrics as an alternative to the [built-in quota metrics](https://cloud.google.com/monitoring/alerts/using-quota-metrics).
Compared to the built-in metrics, it offers a simpler representation of quotas and quota ratios which is especially useful in charts, it allows filtering or combining quotas between different projects regardless of their monitoring workspace, and it creates a default alerting policy without the need to interact directly with the monitoring API.
Compared to the built-in metrics, it offers a simpler representation of quotas and quota ratios which is especially useful in charts, allows filtering or combining quotas between different projects regardless of their monitoring workspace, and optionally creates alerting policies without the need to interact directly with the monitoring API.
Regardless of its specific purpose, this blueprint is also useful in showing how to manipulate and write time series to cloud monitoring. The resources it creates are shown in the high level diagram below:
<img src="diagram.png" width="640px" alt="GCP resource diagram">
The solution is designed so that the Cloud Function arguments that control function execution (eg to set which project quotas to monitor) are defined in the Cloud Scheduler payload set in the PubSub message, so that a single function can be used for different configurations by creating more schedules.
The Cloud Function arguments that control function execution (for example to set which project quotas to monitor) are defined in the Cloud Scheduler payload sent in the PubSub message, so that a single function can be used for different configurations by creating more schedules.
Quota time series are stored using [custom metrics](https://cloud.google.com/monitoring/custom-metrics) with metric type for usage, limit and utilization; metric types are named using a common prefix and two tokens joined by a `-` character:
Quota time series are stored using [custom metrics](https://cloud.google.com/monitoring/custom-metrics) with different metric types for usage, limit and utilization; metric types are based on a common prefix defaulting to `quota` and two tokens representing the quota name and type of data. This is an example:
- `prefix` (custom.googleapis.com/quota/)
- `quota name`
- `{usage,limit,utilization}`
- `custom.googleapis.com/quota/firewalls/usage`
- `custom.googleapis.com/quota/firewalls/limit`
- `custom.googleapis.com/quota/firewalls/ratio`
e.g:
All custom metrics are associated to the `global` resource type and use [gauge kind](https://cloud.google.com/monitoring/api/v3/kinds-and-types#metric-kinds)
- `custom.googleapis.com/quota/firewalls_usage`
- `custom.googleapis.com/quota/firewalls_limit`
- `custom.googleapis.com/quota/firewalls_utilization`
Metric labels contain
All custom metrics are associated to the `global` resource type and use [gauge kind](https://cloud.google.com/monitoring/api/v3/kinds-and-types#metric-kinds)
- `project` set to the project of the quota
- `location` set to the region of the quota (or `global` for project-level quotas)
- `quota` containing the string representation of `usage / limit` for the quota, to provide an immediate reference when checking ratios; this can be easily turned off in code if reducing cardinality is needed
Labels are set with project id (which may differ from the monitoring workspace projects) and region (quotas that are not region specific are labelled `global`), this is how a usage/limit/utilization triplet looks in in Metrics Explorer
Labels are set with project id (which may differ from the monitoring workspace projects) and region (quotas that are not region specific are labelled `global`), this is how the `ratio` metric for a quota looks in in Metrics Explorer
<img src="explorer.png" width="640px" alt="GCP Metrics Explorer, usage, limit and utilization view sample">
The solution can also create a basic monitoring alert policy, to demonstrate how to raise alerts when quotas utilization goes over a predefined threshold, to enable it, set variable `alert_create` to true and reapply main.tf after main.py has run at least one and quota monitoring metrics have been created.
## Configuring resources
The projects where resources are created is also the one where metrics will be written, and is configured via the `project_id` variable. The project can optionally be created by configuring the `project_create_config` variable.
The region, location of the bundle used to deploy the function, and scheduling frequency can also be configured via the relevant variables.
## Configuring Cloud Function parameters
The `quota_config` variable mirrors the arguments accepted by the Python program, and allows configuring several different aspects of its behaviour:
- `quota_config.exclude` do not generate metrics for quotas matching prefixes listed here
- `quota_config.include` only generate metrics for quotas matching prefixes listed here
- `quota_config.projects` projects to track quotas for, defaults to the project where metrics are stored
- `quota_config.regions` regions to track quotas for, defaults to the `global` region for project-level quotas
- `dry_run` do not write actual metrics
- `verbose` increase logging verbosity
The solution can also create a basic monitoring alert policies, to demonstrate how to raise alerts when quotas utilization goes over a predefined threshold, to enable it, set variable `alert_create` to true and reapply main.tf after main.py has run at least one and quota monitoring metrics have been created.
## Running the blueprint
Clone this repository or [open it in cloud shell](https://ssh.cloud.google.com/cloudshell/editor?cloudshell_git_repo=https%3A%2F%2Fgithub.com%2Fterraform-google-modules%2Fcloud-foundation-fabric&cloudshell_print=cloud-shell-readme.txt&cloudshell_working_dir=blueprints%2Fcloud-operations%2Fquota-monitoring), then go through the following steps to create resources:
Clone this repository or [open it in cloud shell](https://ssh.cloud.google.com/cloudshell/editor?cloudshell_git_repo=https%3A%2F%2Fgithub.com%2FGoogleCloudPlatform%2Fcloud-foundation-fabric&cloudshell_print=cloud-shell-readme.txt&cloudshell_working_dir=blueprints%2Fcloud-operations%2Fquota-monitoring), then go through the following steps to create resources:
- `terraform init`
- `terraform apply -var project_id=my-project-id`
@ -42,25 +59,26 @@ Clone this repository or [open it in cloud shell](https://ssh.cloud.google.com/c
| name | description | type | required | default |
|---|---|:---:|:---:|:---:|
| [project_id](variables.tf#L41) | Project id that references existing project. | <code>string</code> | ✓ | |
| [alert_create](variables.tf#L17) | Enables the creation of a sample monitoring alert, false by default. | <code>bool</code> | | <code>false</code> |
| [bundle_path](variables.tf#L23) | Path used to write the intermediate Cloud Function code bundle. | <code>string</code> | | <code>&#34;.&#47;bundle.zip&#34;</code> |
| [name](variables.tf#L29) | Arbitrary string used to name created resources. | <code>string</code> | | <code>&#34;quota-monitor&#34;</code> |
| [project_create](variables.tf#L35) | Create project instead of using an existing one. | <code>bool</code> | | <code>false</code> |
| [quota_config](variables.tf#L46) | Cloud function configuration. | <code title="object&#40;&#123;&#10; filters &#61; list&#40;string&#41;&#10; projects &#61; list&#40;string&#41;&#10; regions &#61; list&#40;string&#41;&#10;&#125;&#41;">object&#40;&#123;&#8230;&#125;&#41;</code> | | <code title="&#123;&#10; filters &#61; null&#10; projects &#61; null&#10; regions &#61; null&#10;&#125;">&#123;&#8230;&#125;</code> |
| [region](variables.tf#L60) | Compute region used in the example. | <code>string</code> | | <code>&#34;europe-west1&#34;</code> |
| [schedule_config](variables.tf#L66) | Schedule timer configuration in crontab format. | <code>string</code> | | <code>&#34;0 &#42; &#42; &#42; &#42;&#34;</code> |
| [project_id](variables.tf#L54) | Project id that references existing project. | <code>string</code> | ✓ | |
| [alert_configs](variables.tf#L17) | Configure creation of monitoring alerts for specific quotas. Keys match quota names. | <code title="map&#40;object&#40;&#123;&#10; documentation &#61; optional&#40;string&#41;&#10; enabled &#61; optional&#40;bool&#41;&#10; labels &#61; optional&#40;map&#40;string&#41;&#41;&#10; threshold &#61; optional&#40;number, 0.75&#41;&#10;&#125;&#41;&#41;">map&#40;object&#40;&#123;&#8230;&#125;&#41;&#41;</code> | | <code>&#123;&#125;</code> |
| [bundle_path](variables.tf#L33) | Path used to write the intermediate Cloud Function code bundle. | <code>string</code> | | <code>&#34;.&#47;bundle.zip&#34;</code> |
| [name](variables.tf#L39) | Arbitrary string used to name created resources. | <code>string</code> | | <code>&#34;quota-monitor&#34;</code> |
| [project_create_config](variables.tf#L45) | Create project instead of using an existing one. | <code title="object&#40;&#123;&#10; billing_account &#61; string&#10; parent &#61; optional&#40;string&#41;&#10;&#125;&#41;">object&#40;&#123;&#8230;&#125;&#41;</code> | | <code>null</code> |
| [quota_config](variables.tf#L59) | Cloud function configuration. | <code title="object&#40;&#123;&#10; exclude &#61; optional&#40;list&#40;string&#41;, &#91;&#10; &#34;a2&#34;, &#34;c2&#34;, &#34;c2d&#34;, &#34;committed&#34;, &#34;g2&#34;, &#34;interconnect&#34;, &#34;m1&#34;, &#34;m2&#34;, &#34;m3&#34;,&#10; &#34;nvidia&#34;, &#34;preemptible&#34;&#10; &#93;&#41;&#10; include &#61; optional&#40;list&#40;string&#41;&#41;&#10; projects &#61; optional&#40;list&#40;string&#41;&#41;&#10; regions &#61; optional&#40;list&#40;string&#41;&#41;&#10; dry_run &#61; optional&#40;bool, false&#41;&#10; verbose &#61; optional&#40;bool, false&#41;&#10;&#125;&#41;">object&#40;&#123;&#8230;&#125;&#41;</code> | | <code>&#123;&#125;</code> |
| [region](variables.tf#L76) | Compute region used in the example. | <code>string</code> | | <code>&#34;europe-west1&#34;</code> |
| [schedule_config](variables.tf#L82) | Schedule timer configuration in crontab format. | <code>string</code> | | <code>&#34;0 &#42; &#42; &#42; &#42;&#34;</code> |
<!-- END TFDOC -->
## Test
```hcl
module "test" {
source = "./fabric/blueprints/cloud-operations/quota-monitoring"
name = "name"
project_create = true
project_id = "test"
source = "./fabric/blueprints/cloud-operations/quota-monitoring"
name = "name"
project_id = "test"
project_create_config = {
billing_account = "12345-ABCDE-12345"
}
}
# tftest modules=4 resources=14
```

View File

@ -1,226 +0,0 @@
#! /usr/bin/env python3
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Sync GCE quota usage to Stackdriver for multiple projects.
This tool fetches global and/or regional quotas from the GCE API for
multiple projects, and sends them to Stackdriver as custom metrics, where they
can be used to set alert policies or create charts.
"""
import base64
import datetime
import json
import logging
import os
import time
import warnings
import click
from google.api_core.exceptions import GoogleAPIError
from google.api import label_pb2 as ga_label
from google.api import metric_pb2 as ga_metric
from google.cloud import monitoring_v3
import googleapiclient.discovery
import googleapiclient.errors
_BATCH_SIZE = 5
_METRIC_KIND = ga_metric.MetricDescriptor.MetricKind.GAUGE
_METRIC_TYPE_STEM = 'custom.googleapis.com/quota/'
_USAGE = "usage"
_LIMIT = "limit"
_UTILIZATION = "utilization"
def _add_series(project_id, series, client=None):
"""Write metrics series to Stackdriver.
Args:
project_id: series will be written to this project id's account
series: the time series to be written, as a list of
monitoring_v3.types.TimeSeries instances
client: optional monitoring_v3.MetricServiceClient will be used
instead of obtaining a new one
"""
client = client or monitoring_v3.MetricServiceClient()
project_name = client.common_project_path(project_id)
if isinstance(series, monitoring_v3.types.TimeSeries):
series = [series]
try:
client.create_time_series(name=project_name, time_series=series)
except GoogleAPIError as e:
raise RuntimeError('Error from monitoring API: %s' % e)
def _configure_logging(verbose=True):
"""Basic logging configuration.
Args:
verbose: enable verbose logging
"""
level = logging.DEBUG if verbose else logging.INFO
logging.basicConfig(level=level)
warnings.filterwarnings('ignore', r'.*end user credentials.*', UserWarning)
def _fetch_quotas(project, region='global', compute=None):
"""Fetch GCE per - project or per - region quotas from the API.
Args:
project: fetch global or regional quotas for this project id
region: which quotas to fetch, 'global' or region name
compute: optional instance of googleapiclient.discovery.build will be used
instead of obtaining a new one
"""
compute = compute or googleapiclient.discovery.build('compute', 'v1')
try:
if region != 'global':
req = compute.regions().get(project=project, region=region)
else:
req = compute.projects().get(project=project)
resp = req.execute()
return resp['quotas']
except (GoogleAPIError, googleapiclient.errors.HttpError) as e:
logging.debug('API Error: %s', e, exc_info=True)
raise RuntimeError('Error fetching quota (project: %s, region: %s)' %
(project, region))
def _get_series(metric_labels, value, metric_type, timestamp, dt=None):
"""Create a Stackdriver monitoring time series from value and labels.
Args:
metric_labels: dict with labels that will be used in the time series
value: time series value
metric_type: which metric is this series for
dt: datetime.datetime instance used for the series end time
"""
series = monitoring_v3.types.TimeSeries()
series.metric.type = metric_type
series.resource.type = 'global'
for label in metric_labels:
series.metric.labels[label] = metric_labels[label]
point = monitoring_v3.types.Point()
point.value.double_value = value
seconds = int(timestamp)
nanos = int((timestamp - seconds) * 10**9)
interval = monitoring_v3.TimeInterval(
{"end_time": {
"seconds": seconds,
"nanos": nanos
}})
point.interval = interval
series.points.append(point)
return series
def _quota_to_series_triplet(project, region, quota):
"""Convert API quota objects to three Stackdriver monitoring time series: usage, limit and utilization
Args:
project: set in converted time series labels
region: set in converted time series labels
quota: quota object received from the GCE API
"""
labels = dict()
labels['project'] = project
labels['region'] = region
try:
utilization = quota['usage'] / float(quota['limit'])
except ZeroDivisionError:
utilization = 0
now = time.time()
metric_type_prefix = _METRIC_TYPE_STEM + quota['metric'].lower() + '_'
return [
_get_series(labels, quota['usage'], metric_type_prefix + _USAGE, now),
_get_series(labels, quota['limit'], metric_type_prefix + _LIMIT, now),
_get_series(labels, utilization, metric_type_prefix + _UTILIZATION, now),
]
@click.command()
@click.option('--monitoring-project', required=True,
help='monitoring project id')
@click.option('--gce-project', multiple=True,
help='project ids (multiple), defaults to monitoring project')
@click.option('--gce-region', multiple=True,
help='regions (multiple), defaults to "global"')
@click.option('--verbose', is_flag=True, help='Verbose output')
@click.argument('keywords', nargs=-1)
def main_cli(monitoring_project=None, gce_project=None, gce_region=None,
verbose=False, keywords=None):
"""Fetch GCE quotas and writes them as custom metrics to Stackdriver.
If KEYWORDS are specified as arguments, only quotas matching one of the
keywords will be stored in Stackdriver.
"""
try:
_main(monitoring_project, gce_project, gce_region, verbose, keywords)
except RuntimeError:
logging.exception('exception raised')
def main(event, context):
"""Cloud Function entry point."""
try:
data = json.loads(base64.b64decode(event['data']).decode('utf-8'))
_main(os.environ.get('GCP_PROJECT'), **data)
# uncomment once https://issuetracker.google.com/issues/155215191 is fixed
# except RuntimeError:
# raise
except Exception:
logging.exception('exception in cloud function entry point')
def _main(monitoring_project, gce_project=None, gce_region=None, verbose=False,
keywords=None):
"""Module entry point used by cli and cloud function wrappers."""
_configure_logging(verbose=verbose)
gce_projects = gce_project or [monitoring_project]
gce_regions = gce_region or ['global']
keywords = set(keywords or [])
logging.debug('monitoring project %s', monitoring_project)
logging.debug('projects %s regions %s', gce_projects, gce_regions)
logging.debug('keywords %s', keywords)
quotas = []
compute = googleapiclient.discovery.build('compute', 'v1',
cache_discovery=False)
for project in gce_projects:
logging.debug('project %s', project)
for region in gce_regions:
logging.debug('region %s', region)
for quota in _fetch_quotas(project, region, compute=compute):
if keywords and not any(k in quota['metric'] for k in keywords):
# logging.debug('skipping %s', quota)
continue
logging.debug('quota %s', quota)
quotas.append((project, region, quota))
client, i = monitoring_v3.MetricServiceClient(), 0
x = len(quotas)
while i < len(quotas):
series = sum(
[_quota_to_series_triplet(*q) for q in quotas[i:i + _BATCH_SIZE]], [])
_add_series(monitoring_project, series, client)
i += _BATCH_SIZE
if __name__ == '__main__':
main_cli()

View File

@ -1,3 +0,0 @@
Click>=7.0
google-api-python-client>=1.10.1
google-cloud-monitoring>=1.1.0

Binary file not shown.

Before

Width:  |  Height:  |  Size: 149 KiB

After

Width:  |  Height:  |  Size: 98 KiB

View File

@ -23,16 +23,15 @@ locals {
}
module "project" {
source = "../../../modules/project"
name = var.project_id
project_create = var.project_create
source = "../../../modules/project"
name = var.project_id
billing_account = try(var.project_create_config.billing_account, null)
parent = try(var.project_create_config.parent, null)
project_create = var.project_create_config != null
services = [
"compute.googleapis.com",
"cloudfunctions.googleapis.com"
]
iam = {
"roles/monitoring.metricWriter" = [module.cf.service_account_iam_email]
}
}
module "pubsub" {
@ -56,15 +55,9 @@ module "cf" {
location = var.region
}
bundle_config = {
source_dir = "${path.module}/cf"
source_dir = "${path.module}/src"
output_path = var.bundle_path
}
# https://github.com/hashicorp/terraform-provider-archive/issues/40
# https://issuetracker.google.com/issues/155215191
environment_variables = {
USE_WORKER_V2 = "true"
PYTHON37_DRAIN_LOGS_ON_CRASH_WAIT_SEC = "5"
}
service_account_create = true
trigger_config = {
event = "google.pubsub.topic.publish"
@ -72,24 +65,28 @@ module "cf" {
}
}
resource "google_cloud_scheduler_job" "job" {
project = var.project_id
resource "google_cloud_scheduler_job" "default" {
project = module.project.project_id
region = var.region
name = var.name
schedule = var.schedule_config
time_zone = "UTC"
pubsub_target {
attributes = {}
topic_name = module.pubsub.topic.id
data = base64encode(jsonencode({
gce_project = var.quota_config.projects
gce_region = var.quota_config.regions
keywords = var.quota_config.filters
}))
data = base64encode(jsonencode(merge(
{ monitoring_project = var.project_id },
var.quota_config
)))
}
}
resource "google_project_iam_member" "metric_writer" {
project = module.project.project_id
role = "roles/monitoring.metricWriter"
member = module.cf.service_account_iam_email
}
resource "google_project_iam_member" "network_viewer" {
for_each = toset(local.projects)
project = each.key
@ -104,17 +101,16 @@ resource "google_project_iam_member" "quota_viewer" {
member = module.cf.service_account_iam_email
}
resource "google_monitoring_alert_policy" "alert_policy" {
count = var.alert_create ? 1 : 0
resource "google_monitoring_alert_policy" "default" {
for_each = var.alert_configs
project = module.project.project_id
display_name = "Quota monitor"
display_name = "Monitor quota ${each.key}"
combiner = "OR"
conditions {
display_name = "simple quota threshold for cpus utilization"
display_name = "Threshold ${each.value.threshold} for ${each.key}."
condition_threshold {
filter = "metric.type=\"custom.googleapis.com/quota/cpus_utilization\" resource.type=\"global\""
threshold_value = 0.75
filter = "metric.type=\"custom.googleapis.com/quota/${each.key}\" resource.type=\"global\""
threshold_value = each.value.threshold
comparison = "COMPARISON_GT"
duration = "0s"
aggregations {
@ -128,16 +124,17 @@ resource "google_monitoring_alert_policy" "alert_policy" {
}
}
}
enabled = false
user_labels = {
name = var.name
}
enabled = each.value.enabled
user_labels = each.value.labels
documentation {
content = "GCE cpus quota over threshold."
content = (
each.value.documentation != null
? each.value.documentation
: "Quota over threshold of ${each.value.threshold} for ${each.key}."
)
}
}
resource "random_pet" "random" {
length = 1
}

View File

@ -0,0 +1,233 @@
#! /usr/bin/env python3
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Sync GCE quota usage to Stackdriver for multiple projects.
This tool fetches global and/or regional quotas from the GCE API for
multiple projects, and sends them to Stackdriver as custom metrics, where they
can be used to set alert policies or create charts.
"""
import base64
import collections
import datetime
import itertools
import json
import logging
import warnings
import click
import google.auth
from google.auth.transport.requests import AuthorizedSession
BASE = 'custom.googleapis.com/quota'
HTTP = AuthorizedSession(google.auth.default()[0])
HTTP_HEADERS = {'content-type': 'application/json; charset=UTF-8'}
URL_PROJECT = 'https://compute.googleapis.com/compute/v1/projects/{}'
URL_REGION = 'https://compute.googleapis.com/compute/v1/projects/{}/regions/{}'
URL_TS = 'https://monitoring.googleapis.com/v3/projects/{}/timeSeries'
_Quota = collections.namedtuple('_Quota',
'project region tstamp metric limit usage')
HTTPRequest = collections.namedtuple(
'HTTPRequest', 'url data headers', defaults=[{}, {
'content-type': 'application/json; charset=UTF-8'
}])
class Quota(_Quota):
'Compute quota.'
def _api_format(self, name, value):
'Return a specific timeseries for this quota in API format.'
d = {
'metric': {
'type': f'{BASE}/{self.metric.lower()}/{name}',
'labels': {
'location': self.region,
'project': self.project
}
},
'resource': {
'type': 'global',
'labels': {}
},
'metricKind':
'GAUGE',
'points': [{
'interval': {
'endTime': f'{self.tstamp.isoformat("T")}Z'
},
'value': {}
}]
}
if name == 'ratio':
d['valueType'] = 'DOUBLE'
d['points'][0]['value'] = {'doubleValue': value}
else:
d['valueType'] = 'INT64'
d['points'][0]['value'] = {'int64Value': value}
# remove this label if cardinality gets too high
d['metric']['labels']['quota'] = f'{self.usage}/{self.limit}'
return d
@property
def timeseries(self):
try:
ratio = self.usage / float(self.limit)
except ZeroDivisionError:
ratio = 0
yield self._api_format('ratio', ratio)
yield self._api_format('usage', self.usage)
# yield self._api_format('limit', self.limit)
def batched(iterable, n):
'Batches data into lists of length n. The last batch may be shorter.'
# batched('ABCDEFG', 3) --> ABC DEF G
if n < 1:
raise ValueError('n must be at least one')
it = iter(iterable)
while (batch := list(itertools.islice(it, n))):
yield batch
def configure_logging(verbose=True):
'Basic logging configuration.'
level = logging.DEBUG if verbose else logging.INFO
logging.basicConfig(level=level)
warnings.filterwarnings('ignore', r'.*end user credentials.*', UserWarning)
def fetch(request, delete=False):
'Minimal HTTP client interface for API calls.'
logging.debug(f'fetch {"POST" if request.data else "GET"} {request.url}')
logging.debug(request.data)
try:
if delete:
response = HTTP.delete(request.url, headers=request.headers)
elif not request.data:
response = HTTP.get(request.url, headers=request.headers)
else:
response = HTTP.post(request.url, headers=request.headers,
data=json.dumps(request.data))
except google.auth.exceptions.RefreshError as e:
raise SystemExit(e.args[0])
try:
rdata = json.loads(response.content)
except json.JSONDecodeError as e:
logging.critical(e)
raise SystemExit(f'Error decoding response: {response.content}')
if response.status_code != 200:
logging.critical(rdata)
error = rdata.get('error', {})
raise SystemExit('API error: {} (HTTP {})'.format(
error.get('message', 'error message cannot be decoded'),
error.get('code', 'no code found')))
return json.loads(response.content)
def write_timeseries(project, data):
'Sends timeseries to the API.'
# try
logging.debug(f'write {len(data["timeSeries"])} timeseries')
request = HTTPRequest(URL_TS.format(project), data)
return fetch(request)
def get_quotas(project, region='global'):
'Fetch GCE per-project or per-region quotas from the API.'
if region == 'global':
request = HTTPRequest(URL_PROJECT.format(project))
else:
request = HTTPRequest(URL_REGION.format(project, region))
resp = fetch(request)
ts = datetime.datetime.utcnow()
for quota in resp.get('quotas'):
yield Quota(project, region, ts, **quota)
@click.command()
@click.argument('project-id', required=True)
@click.option(
'--project-ids', multiple=True, help=
'Project ids to monitor (multiple). Defaults to monitoring project if not set.'
)
@click.option('--regions', multiple=True,
help='Regions (multiple). Defaults to "global" if not set.')
@click.option('--include', multiple=True,
help='Only include quotas starting with keyword (multiple).')
@click.option('--exclude', multiple=True,
help='Exclude quotas starting with keyword (multiple).')
@click.option('--dry-run', is_flag=True, help='Do not write metrics.')
@click.option('--verbose', is_flag=True, help='Verbose output.')
def main_cli(project_id=None, project_ids=None, regions=None, include=None,
exclude=None, dry_run=False, verbose=False):
'Fetch GCE quotas and writes them as custom metrics to Stackdriver.'
try:
_main(project_id, project_ids, regions, include, exclude, dry_run, verbose)
except RuntimeError as e:
logging.exception(f'exception raised: {e.args[0]}')
def main(event, context):
"""Cloud Function entry point."""
try:
data = json.loads(base64.b64decode(event['data']).decode('utf-8'))
_main(**data)
except RuntimeError:
raise
def _main(monitoring_project, projects=None, regions=None, include=None,
exclude=None, dry_run=False, verbose=False):
"""Module entry point used by cli and cloud function wrappers."""
configure_logging(verbose=verbose)
projects = projects or [monitoring_project]
regions = regions or ['global']
include = set(include or [])
exclude = set(exclude or [])
for k in ('monitoring_project', 'projects', 'regions', 'include', 'exclude'):
logging.debug(f'{k} {locals().get(k)}')
timeseries = []
logging.info(f'get quotas ({len(projects)} projects {len(regions)} regions)')
for project in projects:
for region in regions:
logging.info(f'get quota for {project} in {region}')
for quota in get_quotas(project, region):
metric = quota.metric.lower()
if include and not any(metric.startswith(k) for k in include):
logging.debug(f'skipping {project}:{region}:{metric} not included')
continue
if exclude and any(metric.startswith(k) for k in exclude):
logging.debug(f'skipping {project}:{region}:{metric} excluded')
continue
logging.debug(f'quota {project}:{region}:{metric}')
timeseries += list(quota.timeseries)
logging.info(f'{len(timeseries)} timeseries')
i, l = 0, len(timeseries)
for batch in batched(timeseries, 30):
data = list(batch)
logging.info(f'sending {len(batch)} timeseries out of {l - i}/{l} left')
i += len(batch)
if not dry_run:
write_timeseries(monitoring_project, {'timeSeries': list(data)})
elif verbose:
print(data)
logging.info(f'{l} timeseries done (dry run {dry_run})')
if __name__ == '__main__':
main_cli()

View File

@ -0,0 +1,4 @@
click
functions-framework
google-api-core
google-cloud-monitoring

View File

@ -0,0 +1,104 @@
#!/usr/bin/env python
# Copyright 2023 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
'Manages metric descriptors for the DR metrics.'
import collections
import json
import logging
import urllib.parse
import click
import google.auth
from google.auth.transport.requests import AuthorizedSession
Descriptor = collections.namedtuple('Descriptor', 'name labels is_bool r_type',
defaults=[True, 'global'])
HTTPRequest = collections.namedtuple(
'HTTPRequest', 'url data headers', defaults=[{}, {
'content-type': 'application/json; charset=UTF-8'
}])
class Error(Exception):
pass
BASE = 'custom.googleapis.com/quota'
HTTP = AuthorizedSession(google.auth.default()[0])
def descriptors_get(project):
base = urllib.parse.quote_plus(BASE)
url = (f'https://content-monitoring.googleapis.com/v3/projects/{project}/'
'metricDescriptors?filter=metric.type%20%3D%20starts_with'
f'(%22{base}%22)')
return HTTPRequest(url)
def descriptor_delete(project, type):
url = (f'https://monitoring.googleapis.com/v3/projects/{project}/'
f'metricDescriptors/{type}')
return HTTPRequest(url)
def fetch(request, delete=False):
'Minimal HTTP client interface for API calls.'
# try
logging.debug(f'fetch {"POST" if request.data else "GET"} {request.url}')
try:
if delete:
response = HTTP.delete(request.url, headers=request.headers)
elif not request.data:
response = HTTP.get(request.url, headers=request.headers)
else:
response = HTTP.post(request.url, headers=request.headers,
data=json.dumps(request.data))
except google.auth.exceptions.RefreshError as e:
raise SystemExit(e.args[0])
if response.status_code != 200:
logging.critical(
f'response code {response.status_code} for URL {request.url}')
logging.critical(response.content)
logging.debug(request.data)
raise Error('API error')
return json.loads(response.content)
@click.command()
@click.argument('project')
@click.option('--delete', default=False, is_flag=True,
help='Delete descriptors.')
@click.option('--dry-run', default=False, is_flag=True,
help='Show but to not perform actions.')
def main(project, delete=False, dry_run=False):
'Program entry point.'
logging.basicConfig(level=logging.INFO)
logging.info(f'getting descriptors for "{BASE}"')
response = fetch(descriptors_get(project))
existing = [d['type'] for d in response.get('metricDescriptors', [])]
logging.info(f'{len(existing)} descriptors')
if delete:
for name in existing:
logging.info(f'deleting descriptor {name}')
if not dry_run:
try:
fetch(descriptor_delete(project, name), delete=True)
except Error:
logging.critical(f'error deleting descriptor {name}')
if __name__ == '__main__':
main()

View File

@ -14,10 +14,20 @@
* limitations under the License.
*/
variable "alert_create" {
description = "Enables the creation of a sample monitoring alert, false by default."
type = bool
default = false
variable "alert_configs" {
description = "Configure creation of monitoring alerts for specific quotas. Keys match quota names."
type = map(object({
documentation = optional(string)
enabled = optional(bool)
labels = optional(map(string))
threshold = optional(number, 0.75)
}))
nullable = false
default = {}
validation {
condition = alltrue([for k, v in var.alert_configs : v != null])
error_message = "Set values as {} instead of null."
}
}
variable "bundle_path" {
@ -32,10 +42,13 @@ variable "name" {
default = "quota-monitor"
}
variable "project_create" {
variable "project_create_config" {
description = "Create project instead of using an existing one."
type = bool
default = false
type = object({
billing_account = string
parent = optional(string)
})
default = null
}
variable "project_id" {
@ -46,15 +59,18 @@ variable "project_id" {
variable "quota_config" {
description = "Cloud function configuration."
type = object({
filters = list(string)
projects = list(string)
regions = list(string)
exclude = optional(list(string), [
"a2", "c2", "c2d", "committed", "g2", "interconnect", "m1", "m2", "m3",
"nvidia", "preemptible"
])
include = optional(list(string))
projects = optional(list(string))
regions = optional(list(string))
dry_run = optional(bool, false)
verbose = optional(bool, false)
})
default = {
filters = null
projects = null
regions = null
}
nullable = false
default = {}
}
variable "region" {