cloud-foundation-fabric/blueprints/cloud-operations/network-dashboard/README.md

# Networking Dashboard

This repository provides an end-to-end solution to gather some GCP Networking quotas and limits (that cannot be seen in the GCP console today) and display them in a dashboard.
The goal is to allow for better visibility of these limits, facilitating capacity planning and avoiding hitting these limits.

Here is an example of dashboard you can get with this solution:

<img src="metric.png" width="640px">

Here you see utilization (usage compared to the limit) for a specific metric (number of instances per VPC) for multiple VPCs and projects.

Three metric descriptors are created for each monitored resource: usage, limit and utilization. You can follow each of these and create alerting policies if a threshold is reached.

## Usage

Clone this repository, then go through the following steps to create resources:
- Create a terraform.tfvars file with the following content:
  ```tfvars
  - organization_id = "<YOUR-ORG-ID>"
  - billing_account = "<YOUR-BILLING-ACCOUNT>"
  - monitoring_project_id = "<YOUR-MONITORING-PROJECT>" # Monitoring project where the dashboard will be created and the solution deployed, a project named "mon-network-dahshboard" will be created if left blank
  - monitored_projects_list = ["project-1", "project2"] # Projects to be monitored by the solution
  - monitored_folders_list = ["folder_id"] # Folders to be monitored by the solution
  - prefix = "<YOUR-PREFIX>" # Monitoring project name prefix, monitoring project name is <YOUR-PREFIX>-network-dashboard, ignored if monitoring_project_id variable is provided
  - v2 = true|false # Set to true to use V2 Cloud Functions environment
    ```
- `terraform init`
- `terraform apply`

Note: Org level viewing permission is required for some metrics such as firewall policies.

Once the resources are deployed, go to the following page to see the dashboard: https://console.cloud.google.com/monitoring/dashboards?project=<YOUR-MONITORING-PROJECT> (or <YOUR-METRICS-PROJECT> if populated)
A dashboard called "quotas-utilization" should be created.

The Cloud Function runs every 10 minutes by default so you should start getting some data points after a few minutes.
You can use the metric explorer to view the data points for the different custom metrics created: https://console.cloud.google.com/monitoring/metrics-explorer?project=<YOUR-MONITORING-PROJECT> (or <YOUR-METRICS-PROJECT> if populated).
You can change this frequency by modifying the "schedule_cron" variable in variables.tf.

Note that some charts in the dashboard align values over 1h so you might need to wait 1h to see charts on the dashboard views.

Once done testing, you can clean up resources by running `terraform destroy`.

## Supported limits and quotas
The Cloud Function currently tracks usage, limit and utilization of:
- active VPC peerings per VPC
- VPC peerings per VPC
- instances per VPC
- instances per VPC peering group
- Subnet IP ranges per VPC peering group
- internal forwarding rules for internal L4 load balancers per VPC
- internal forwarding rules for internal L7 load balancers per VPC
- internal forwarding rules for internal L4 load balancers per VPC peering group
- internal forwarding rules for internal L7 load balancers per VPC peering group
- Dynamic routes per VPC 
- Dynamic routes per VPC peering group 
- Static routes per project (VPC drill down is available for usage)
- Static routes per VPC peering group 
- IP utilization per subnet (% of IP addresses used in a subnet)
- VPC firewall rules per project (VPC drill down is available for usage)
- Tuples per Firewall Policy

It writes this values to custom metrics in Cloud Monitoring and creates a dashboard to visualize the current utilization of these metrics in Cloud Monitoring.

Note that metrics are created in the cloud-function/metrics.yaml file. You can also edit default limits for a specific network in that file. See the example for `vpc_peering_per_network`.

## Assumptions and limitations
- The CF assumes that all VPCs in peering groups are within the same organization, except for PSA peerings
- The CF will only fetch subnet utilization data from the PSA peerings (not the VMs, ILB or routes usage)
- The CF assumes global routing is ON, this impacts dynamic routes usage calculation
- The CF assumes custom routes importing/exporting is ON, this impacts static and dynamic routes usage calculation
- The CF assumes all networks in peering groups have the same global routing and custom routes sharing configuration

## Next steps and ideas
In a future release, we could support:
- Google managed VPCs that are peered with PSA (such as Cloud SQL or Memorystore)
- Dynamic routes calculation for VPCs/PPGs with "global routing" set to OFF
- Static routes calculation for projects/PPGs with "custom routes importing/exporting" set to OFF
- Calculations for cross Organization peering groups
- Support different scopes (reduced and fine-grained) 

If you are interested in this and/or would like to contribute, please contact legranda@google.com.
<!-- BEGIN TFDOC -->

## Variables

| name | description | type | required | default |
|---|---|:---:|:---:|:---:|
| [billing_account](variables.tf#L17) | The ID of the billing account to associate this project with | <code></code> | ✓ |  |
| [monitored_projects_list](variables.tf#L36) | ID of the projects to be monitored (where limits and quotas data will be pulled) | <code>list&#40;string&#41;</code> | ✓ |  |
| [organization_id](variables.tf#L54) | The organization id for the associated services | <code></code> | ✓ |  |
| [prefix](variables.tf#L58) | Customer name to use as prefix for monitoring project | <code></code> | ✓ |  |
| [cf_version](variables.tf#L21) | Cloud Function version 2nd Gen or 1st Gen. Possible options: 'V1' or 'V2'.Use CFv2 if your Cloud Function timeouts after 9 minutes. By default it is using CFv1. | <code></code> |  | <code>V1</code> |
| [metrics_project_id](variables.tf#L46) | Optional, populate to write metrics and deploy the dashboard in a separated project | <code></code> |  |  |
| [monitored_folders_list](variables.tf#L30) | ID of the projects to be monitored (where limits and quotas data will be pulled) | <code>list&#40;string&#41;</code> |  | <code>&#91;&#93;</code> |
| [monitoring_project_id](variables.tf#L41) | Monitoring project where the dashboard will be created and the solution deployed; a project will be created if set to empty string, if metrics_project_id is provided, metrics and dashboard will be deployed there  | <code></code> |  |  |
| [project_monitoring_services](variables.tf#L63) | Service APIs enabled in the monitoring project if it will be created. | <code></code> |  | <code title="&#91;&#10;  &#34;artifactregistry.googleapis.com&#34;,&#10;  &#34;cloudasset.googleapis.com&#34;,&#10;  &#34;cloudbilling.googleapis.com&#34;,&#10;  &#34;cloudbuild.googleapis.com&#34;,&#10;  &#34;cloudfunctions.googleapis.com&#34;,&#10;  &#34;cloudresourcemanager.googleapis.com&#34;,&#10;  &#34;cloudscheduler.googleapis.com&#34;,&#10;  &#34;compute.googleapis.com&#34;,&#10;  &#34;iam.googleapis.com&#34;,&#10;  &#34;iamcredentials.googleapis.com&#34;,&#10;  &#34;logging.googleapis.com&#34;,&#10;  &#34;monitoring.googleapis.com&#34;,&#10;  &#34;pubsub.googleapis.com&#34;,&#10;  &#34;run.googleapis.com&#34;,&#10;  &#34;servicenetworking.googleapis.com&#34;,&#10;  &#34;serviceusage.googleapis.com&#34;,&#10;  &#34;storage-component.googleapis.com&#34;&#10;&#93;">&#91;&#8230;&#93;</code> |
| [region](variables.tf#L88) | Region used to deploy the cloud functions and scheduler | <code></code> |  | <code>europe-west1</code> |
| [schedule_cron](variables.tf#L93) | Cron format schedule to run the Cloud Function. Default is every 10 minutes. | <code></code> |  | <code>&#42;&#47;10 &#42; &#42; &#42; &#42;</code> |
| [vpc_connector_name](variables.tf#L99) | Serverless VPC connection name for the Cloud Function | <code></code> |  |  |

<!-- END TFDOC -->
Networking dashboard to display per VPC and per VPC peering group limits that are not shown in the console 2022-03-08 09:36:02 -08:00			`# Networking Dashboard`

			`This repository provides an end-to-end solution to gather some GCP Networking quotas and limits (that cannot be seen in the GCP console today) and display them in a dashboard.`
			`The goal is to allow for better visibility of these limits, facilitating capacity planning and avoiding hitting these limits.`

Network dashboard: Subnet IP utilization update (#837) * Adding IP utilization per subnet metrics and folder level support. * Update README.md * Removing unused imports * yapf formatting * removing unused imports * removing hard coded prefix * Update README.md * Variable renaming * variable renaming * formatting * Comments and yapf formatting * in the proper blueprints folder this time * Updated after comments from David. Co-authored-by: Aurélien Legrand <legranda@google.com> 2022-09-30 01:51:16 -07:00			`Here is an example of dashboard you can get with this solution:`
Improving README.md and adding a picture 2022-03-18 02:43:16 -07:00
			`<img src="metric.png" width="640px">`

Improving README 2022-03-18 02:45:54 -07:00			`Here you see utilization (usage compared to the limit) for a specific metric (number of instances per VPC) for multiple VPCs and projects.`
Improving README 2022-03-18 02:46:30 -07:00
Added charts to dashboard, fixed a merge glitch, updated readme, removed hardcoded parameters 2022-10-04 01:11:09 -07:00			`Three metric descriptors are created for each monitored resource: usage, limit and utilization. You can follow each of these and create alerting policies if a threshold is reached.`
Improving README 2022-03-18 02:45:54 -07:00
Networking dashboard to display per VPC and per VPC peering group limits that are not shown in the console 2022-03-08 09:36:02 -08:00			`## Usage`

			`Clone this repository, then go through the following steps to create resources:`
			`- Create a terraform.tfvars file with the following content:`
Update README.md 2022-10-27 00:18:34 -07:00			```tfvars
Network dashboard: Subnet IP utilization update (#837) * Adding IP utilization per subnet metrics and folder level support. * Update README.md * Removing unused imports * yapf formatting * removing unused imports * removing hard coded prefix * Update README.md * Variable renaming * variable renaming * formatting * Comments and yapf formatting * in the proper blueprints folder this time * Updated after comments from David. Co-authored-by: Aurélien Legrand <legranda@google.com> 2022-09-30 01:51:16 -07:00			`- organization_id = "<YOUR-ORG-ID>"`
			`- billing_account = "<YOUR-BILLING-ACCOUNT>"`
fixed proj creation and readme 2022-10-28 06:55:47 -07:00			`- monitoring_project_id = "<YOUR-MONITORING-PROJECT>" # Monitoring project where the dashboard will be created and the solution deployed, a project named "mon-network-dahshboard" will be created if left blank`
Networking dashboard to display per VPC and per VPC peering group limits that are not shown in the console 2022-03-08 09:36:02 -08:00			`- monitored_projects_list = ["project-1", "project2"] # Projects to be monitored by the solution`
Network dashboard: Subnet IP utilization update (#837) * Adding IP utilization per subnet metrics and folder level support. * Update README.md * Removing unused imports * yapf formatting * removing unused imports * removing hard coded prefix * Update README.md * Variable renaming * variable renaming * formatting * Comments and yapf formatting * in the proper blueprints folder this time * Updated after comments from David. Co-authored-by: Aurélien Legrand <legranda@google.com> 2022-09-30 01:51:16 -07:00			`- monitored_folders_list = ["folder_id"] # Folders to be monitored by the solution`
fixed proj creation and readme 2022-10-28 06:55:47 -07:00			`- prefix = "<YOUR-PREFIX>" # Monitoring project name prefix, monitoring project name is <YOUR-PREFIX>-network-dashboard, ignored if monitoring_project_id variable is provided`
Network Dashboard: CFv2 and performance improvements (#896) * Improving Cloud Function v2 support for the Networking Dashboard. * yapf formatting. * Improving support for CFv2 and performance for asset inventory queries * Update README.md added v2 parameter do readme * cleanip up main.py and improving performance by 20% in metrics.py by reusing the same client instance * Making a condition clearer * Update blueprints/cloud-operations/network-dashboard/cloud-function/metrics/subnets.py Co-authored-by: David Gleich <gleichda@google.com> * Update after PR from Maurizio on Firewall Policies and David's comments. Co-authored-by: maunope <44614195+maunope@users.noreply.github.com> Co-authored-by: David Gleich <gleichda@google.com> 2022-10-19 09:59:28 -07:00			`- v2 = true\|false # Set to true to use V2 Cloud Functions environment`
Merge branch 'maunope/static_routes' of https://github.com/maunope/cloud-foundation-fabric into maunope/static_routes 2022-10-28 06:58:33 -07:00			```
Networking dashboard to display per VPC and per VPC peering group limits that are not shown in the console 2022-03-08 09:36:02 -08:00			- `terraform init`
			- `terraform apply`

Update README.md 2022-10-27 10:22:42 -07:00			`Note: Org level viewing permission is required for some metrics such as firewall policies.`

readme updates, removed default prefix, completed vpc-sc compatibility 2022-11-10 03:44:17 -08:00			`Once the resources are deployed, go to the following page to see the dashboard: https://console.cloud.google.com/monitoring/dashboards?project=<YOUR-MONITORING-PROJECT> (or <YOUR-METRICS-PROJECT> if populated)`
Networking dashboard to display per VPC and per VPC peering group limits that are not shown in the console 2022-03-08 09:36:02 -08:00			`A dashboard called "quotas-utilization" should be created.`

Update README.md 2022-10-19 07:40:42 -07:00			`The Cloud Function runs every 10 minutes by default so you should start getting some data points after a few minutes.`
readme updates, removed default prefix, completed vpc-sc compatibility 2022-11-10 03:44:17 -08:00			`You can use the metric explorer to view the data points for the different custom metrics created: https://console.cloud.google.com/monitoring/metrics-explorer?project=<YOUR-MONITORING-PROJECT> (or <YOUR-METRICS-PROJECT> if populated).`
Networking dashboard to display per VPC and per VPC peering group limits that are not shown in the console 2022-03-08 09:36:02 -08:00			`You can change this frequency by modifying the "schedule_cron" variable in variables.tf.`

Update README.md 2022-10-19 07:40:42 -07:00			`Note that some charts in the dashboard align values over 1h so you might need to wait 1h to see charts on the dashboard views.`

Networking dashboard to display per VPC and per VPC peering group limits that are not shown in the console 2022-03-08 09:36:02 -08:00			Once done testing, you can clean up resources by running `terraform destroy`.

			`## Supported limits and quotas`
			`The Cloud Function currently tracks usage, limit and utilization of:`
			`- active VPC peerings per VPC`
			`- VPC peerings per VPC`
			`- instances per VPC`
			`- instances per VPC peering group`
			`- Subnet IP ranges per VPC peering group`
			`- internal forwarding rules for internal L4 load balancers per VPC`
			`- internal forwarding rules for internal L7 load balancers per VPC`
			`- internal forwarding rules for internal L4 load balancers per VPC peering group`
			`- internal forwarding rules for internal L7 load balancers per VPC peering group`
updated readme 2022-10-25 02:37:14 -07:00			`- Dynamic routes per VPC`
			`- Dynamic routes per VPC peering group`
added support for ppg static routes 2022-10-12 05:51:03 -07:00			`- Static routes per project (VPC drill down is available for usage)`
updated readme 2022-10-25 02:37:14 -07:00			`- Static routes per VPC peering group`
Network dashboard: Subnet IP utilization update (#837) * Adding IP utilization per subnet metrics and folder level support. * Update README.md * Removing unused imports * yapf formatting * removing unused imports * removing hard coded prefix * Update README.md * Variable renaming * variable renaming * formatting * Comments and yapf formatting * in the proper blueprints folder this time * Updated after comments from David. Co-authored-by: Aurélien Legrand <legranda@google.com> 2022-09-30 01:51:16 -07:00			`- IP utilization per subnet (% of IP addresses used in a subnet)`
Added charts to dashboard, fixed a merge glitch, updated readme, removed hardcoded parameters 2022-10-04 01:11:09 -07:00			`- VPC firewall rules per project (VPC drill down is available for usage)`
updated dashbaord and readme 2022-10-10 08:45:08 -07:00			`- Tuples per Firewall Policy`
Networking dashboard to display per VPC and per VPC peering group limits that are not shown in the console 2022-03-08 09:36:02 -08:00
Refactored how limits are managed, now you can edit the metrics.yaml file to set specific metrics per network. 2022-03-28 09:44:16 -07:00			`It writes this values to custom metrics in Cloud Monitoring and creates a dashboard to visualize the current utilization of these metrics in Cloud Monitoring.`

updated readme 2022-10-25 02:37:14 -07:00			Note that metrics are created in the cloud-function/metrics.yaml file. You can also edit default limits for a specific network in that file. See the example for `vpc_peering_per_network`.

			`## Assumptions and limitations`
			`- The CF assumes that all VPCs in peering groups are within the same organization, except for PSA peerings`
Update README.md 2022-10-25 02:50:40 -07:00			`- The CF will only fetch subnet utilization data from the PSA peerings (not the VMs, ILB or routes usage)`
updated readme 2022-10-25 02:37:14 -07:00			`- The CF assumes global routing is ON, this impacts dynamic routes usage calculation`
			`- The CF assumes custom routes importing/exporting is ON, this impacts static and dynamic routes usage calculation`
			`- The CF assumes all networks in peering groups have the same global routing and custom routes sharing configuration`
Adding Dynamic Routes per Network as a new metric. 2022-03-30 08:03:31 -07:00
Refactored how limits are managed, now you can edit the metrics.yaml file to set specific metrics per network. 2022-03-28 09:44:16 -07:00			`## Next steps and ideas`
			`In a future release, we could support:`
			`- Google managed VPCs that are peered with PSA (such as Cloud SQL or Memorystore)`
added support for ppg static routes 2022-10-12 05:51:03 -07:00			`- Dynamic routes calculation for VPCs/PPGs with "global routing" set to OFF`
			`- Static routes calculation for projects/PPGs with "custom routes importing/exporting" set to OFF`
updated readme 2022-10-25 02:37:14 -07:00			`- Calculations for cross Organization peering groups`
Update README.md 2022-10-27 10:22:42 -07:00			`- Support different scopes (reduced and fine-grained)`
Refactored how limits are managed, now you can edit the metrics.yaml file to set specific metrics per network. 2022-03-28 09:44:16 -07:00
Update README.md 2022-10-19 07:40:42 -07:00			`If you are interested in this and/or would like to contribute, please contact legranda@google.com.`
Improve net dashboard variables 2022-10-20 05:31:40 -07:00			`<!-- BEGIN TFDOC -->`

			`## Variables`

			`\| name \| description \| type \| required \| default \|`
			`\|---\|---\|:---:\|:---:\|:---:\|`
Update README 2022-10-20 05:55:45 -07:00			`\| [billing_account](variables.tf#L17) \| The ID of the billing account to associate this project with \| <code></code> \| ✓ \| \|`
			`\| [monitored_projects_list](variables.tf#L36) \| ID of the projects to be monitored (where limits and quotas data will be pulled) \| <code>list(string)</code> \| ✓ \| \|`
updated tfdoc 2022-11-10 03:57:52 -08:00			`\| [organization_id](variables.tf#L54) \| The organization id for the associated services \| <code></code> \| ✓ \| \|`
			`\| [prefix](variables.tf#L58) \| Customer name to use as prefix for monitoring project \| <code></code> \| ✓ \| \|`
Update README 2022-10-20 05:55:45 -07:00			`\| [cf_version](variables.tf#L21) \| Cloud Function version 2nd Gen or 1st Gen. Possible options: 'V1' or 'V2'.Use CFv2 if your Cloud Function timeouts after 9 minutes. By default it is using CFv1. \| <code></code> \| \| <code>V1</code> \|`
updated tfdoc 2022-11-10 03:57:52 -08:00			`\| [metrics_project_id](variables.tf#L46) \| Optional, populate to write metrics and deploy the dashboard in a separated project \| <code></code> \| \| \|`
Update README 2022-10-20 05:55:45 -07:00			`\| [monitored_folders_list](variables.tf#L30) \| ID of the projects to be monitored (where limits and quotas data will be pulled) \| <code>list(string)</code> \| \| <code>[]</code> \|`
updated tfdoc 2022-11-10 03:57:52 -08:00			`\| [monitoring_project_id](variables.tf#L41) \| Monitoring project where the dashboard will be created and the solution deployed; a project will be created if set to empty string, if metrics_project_id is provided, metrics and dashboard will be deployed there \| <code></code> \| \| \|`
			\| [project_monitoring_services](variables.tf#L63) \| Service APIs enabled in the monitoring project if it will be created. \| <code></code> \| \| <code title="[ "artifactregistry.googleapis.com", "cloudasset.googleapis.com", "cloudbilling.googleapis.com", "cloudbuild.googleapis.com", "cloudfunctions.googleapis.com", "cloudresourcemanager.googleapis.com", "cloudscheduler.googleapis.com", "compute.googleapis.com", "iam.googleapis.com", "iamcredentials.googleapis.com", "logging.googleapis.com", "monitoring.googleapis.com", "pubsub.googleapis.com", "run.googleapis.com", "servicenetworking.googleapis.com", "serviceusage.googleapis.com", "storage-component.googleapis.com" ]">[…]</code> \|
			`\| [region](variables.tf#L88) \| Region used to deploy the cloud functions and scheduler \| <code></code> \| \| <code>europe-west1</code> \|`
			`\| [schedule_cron](variables.tf#L93) \| Cron format schedule to run the Cloud Function. Default is every 10 minutes. \| <code></code> \| \| <code>/10 * * *</code> \|`
			`\| [vpc_connector_name](variables.tf#L99) \| Serverless VPC connection name for the Cloud Function \| <code></code> \| \| \|`
Improve net dashboard variables 2022-10-20 05:31:40 -07:00
			`<!-- END TFDOC -->`