cloud-foundation-fabric/fast/stages/00-bootstrap/README.md

21 KiB

Organization bootstrap

The primary purpose of this stage is to enable critical organization-level functionality that depends on broad administrative permissions, and prepare the prerequisites needed to enable automation in this and future stages.

It is intentionally simple, to minimize usage of administrative-level permissions and enable simple auditing and troubleshooting, and only deals with three sets of resources:

  • project, service accounts, and GCS buckets for automation
  • projects, BQ datasets, and sinks for audit log and billing exports
  • IAM bindings on the organization

Use the following diagram as a simple high level reference for the following sections, which describe the stage and its possible customizations in detail.

Organization-level diagram

Design overview and choices

As mentioned above, this stage only does the bare minimum required to bootstrap automation, and ensure that base audit and billing exports are in place from the start to provide some measure of accountability, even before the security configurations are applied in a later stage.

It also sets up organization-level IAM bindings so the Organization Administrator role is only used here, trading off some design freedom for ease of auditing and troubleshooting, and reducing the risk of costly security mistakes down the line. The only exception to this rule is for the Resource Management stage service account, described below.

User groups

User groups are important, not only here but throughout the whole automation process. They provide a stable frame of reference that allows decoupling the final set of permissions for each group, from the stage where entities and resources are created and their IAM bindings defined. For example, the final set of roles for the networking group is contributed by this stage at the organization level (XPN Admin, Cloud Asset Viewer, etc.), and by the Resource Management stage at the folder level.

We have standardized the initial set of groups on those outlined in the GCP Enterprise Setup Checklist to simplify adoption. They provide a comprehensive and flexible starting point that can suit most users. Adding new groups, or deviating from the initial setup is possible and reasonably simple, and it's briefly outlined in the customization section below.

Organization-level IAM

The service account used in the Resource Management stage needs to be able to grant specific roles at the organizational level (roles/billing.user, roles/compute.xpnAdmin, etc.), to enable specific functionality for subsequent stages that deal with network or security resources, or billing-related activities.

In order to be able to assign those roles without having the full authority of the Organization Admin role, this stage defines a custom role that only allows setting IAM policies on the organization, and grants it via a delegated role grant that only allows it to be used to grant a limited subset of roles.

In this way, the Resource Management service account can effectively act as an Organization Admin, but only to grant the roles it effectively needs to control.

One consequence of the above setup, is the need to configure IAM bindings as non-authoritative for the roles included in the IAM condition, since those same roles are effectively under the control of two stages: this one and Resource Management. Using authoritative bindings for these roles (instead of non-authoritative ones) would generate potential conflicts, where each stage could try to overwrite and negate the bindings applied by the other at each apply cycle.

Automation project and resources

One other design choice worth mentioning here is using a single automation project for all foundational stages. We trade off some complexity on the API side (single source for usage quota, multiple service activation) for increased flexibility and simpler operations, while still effectively providing the same degree of separation via resource-level IAM.

Billing account

We support three use cases in regards to billing:

  • the billing account is part of this same organization, IAM bindings will be set at the organization level
  • the billing account is part of a different organization, billing IAM bindings will be set at the organization level in the billing account owning organization
  • the billing account is not considered part of an organization (even though it might be), billing IAM bindings are set on the billing account itself

For same-organization billing, we configure a custom organization role that can set IAM bindings, via a delegated role grant to limit its scope to the relevant roles.

For details on configuring the different billing account modes, refer to the How to run this stage section below.

Naming

We are intentionally not supporting random prefix/suffixes for names, as that is an antipattern typically only used in development. It does not map to our customer's actual production usage, where they always adopt a fixed naming convention.

What is implemented here is a fairly common convention, composed of tokens ordered by relative importance:

  • a static prefix (e.g. myco or myco-gcp)
  • an environment identifier (e.g. prod)
  • a team/owner identifier (e.g. sec for Security)
  • a context identifier (e.g. core or kms)
  • an arbitrary identifier used to distinguish similar resources (e.g. 0, 1)

Tokens are joined by a - character, making it easy to separate the individual tokens visually, and to programmatically split them in billing exports to derive initial high-level groupings for cost attribution.

The convention is used in its full form only for specific resources with globally unique names (projects, GCS buckets). Other resources adopt a shorter version for legibility, as the full context can always be derived from their project.

The Customizations section on names below explains how to configure tokens, or implement a different naming convention.

How to run this stage

This stage has straightforward initial requirements, as it is designed to work on newly created GCP organizations. Four steps are needed to bring up this stage:

  • an Organization Admin self-assigns the required roles listed below
  • the same administrator runs the first init/apply sequence passing a special variable to apply
  • the providers configuration file is derived from the Terraform output or linked from the generated file
  • a second init is run to migrate state, and from then on, the stage is run via impersonation

Prerequisites

The roles that the Organization Admin used in the first apply needs to self-grant are:

  • Billing Account Administrator (roles/billing.admin) either on the organization or the billing account (see the following section for details)
  • Logging Admin (roles/logging.admin)
  • Organization Role Administrator (roles/iam.organizationRoleAdmin)
  • Organization Administrator (roles/resourcemanager.organizationAdmin)
  • Project Creator (roles/resourcemanager.projectCreator)

To quickly self-grant the above roles, run the following code snippet as the initial Organization Admin:

export BOOTSTRAP_ORG_ID=123456
export BOOTSTRAP_USER=$(gcloud config list --format 'value(core.account)')
export BOOTSTRAP_ROLES=(roles/billing.admin roles/logging.admin roles/iam.organizationRoleAdmin roles/resourcemanager.projectCreator)
for role in $BOOTSTRAP_ROLES; do
  gcloud organizations add-iam-policy-binding $BOOTSTRAP_ORG_ID \
    --member user:$BOOTSTRAP_USER --role $role
done

Billing account in a different organization

If you are using a billing account belonging to a different organization (e.g. in multiple organization setups), some initial configurations are needed to ensure the identities running this stage can assign billing-related roles.

If the billing organization is managed by another version of this stage, we leverage the organizationIamAdmin role created there, to allow restricted granting of billing roles at the organization level.

If that's not the case, an equivalent role needs to exist, or the predefined resourcemanager.organizationAdmin role can be used if not managed authoritatively. The role name then needs to be manually changed in the billing.tf file, in the google_organization_iam_binding resource.

The identity applying this stage for the first time also needs two roles in billing organization, they can be removed after the first apply completes successfully:

export BILLING_ORG_ID=789012
export BILLING_ROLES=(roles/billing.admin roles/resourcemanager.organizationAdmin)
for role in $BILLING_ROLES; do
  gcloud organizations add-iam-policy-binding $BILLING_ORG_ID \
    --member user:$BOOTSTRAP_USER --role $role
done

Standalone billing account

If you are using a standalone billing account, the identity applying this stage for the first time needs to be a billing account administrator:

export BILLING_ACCOUNT_ID=ABCD-01234-ABCD
gcloud beta billing accounts add-iam-policy-binding $BILLING_ACCOUNT \
  --member user:$BOOTSTRAP_USER --role roles/billing.admin

Groups

Before the first run, the following IAM groups must exist to allow IAM bindings to be created (actual names are flexible, see the Customization section):

  • gcp-billing-admins
  • gcp-devops
  • gcp-network-admins
  • gcp-organization-admins
  • gcp-security-admins
  • gcp-support

Configure variables

Then make sure you have configured the correct values for the following variables by editing providing a terraform.tfvars file:

  • billing_account an object containing the id of your billing account, derived from the Cloud Console UI or by running gcloud beta billing accounts list, and the id of the organization owning it, or null to use the billing account in isolation
  • groups the name mappings for your groups, if you're following the default convention you can leave this to the provided default
  • organization.id, organization.domain, organization.customer_id the id, domain and customer id of your organization, derived from the Cloud Console UI or by running gcloud organizations list
  • prefix the fixed prefix used in your naming convention

Output files and cross-stage variables

At any time during the life of this stage, you can configure it to automatically generate provider configurations and variable files for the following, to simplify exchanging inputs and outputs between stages and avoid having to edit files manually.

Automatic generation of files is disabled by default. To enable the mechanism, set the outputs_location variable to a valid path on a local filesystem, e.g.

outputs_location = "../../config"

Once the variable is set, apply will generate and manage providers and variables files, including the initial one used for this stage after the first run. You can then link these files in the relevant stages, instead of manually transfering outputs from one stage, to Terraform variables in another.

Below is the outline of the output files generated by this stage:

[path specified in outputs_location]
├── 00-bootstrap
│   ├── providers.tf
├── 01-resman
│   ├── providers.tf
│   ├── terraform-bootstrap.auto.tfvars.json
├── 02-networking
│   ├── terraform-bootstrap.auto.tfvars.json
├── 02-networking-nva
│   ├── terraform-bootstrap.auto.tfvars.json
├── 02-security
│   ├── terraform-bootstrap.auto.tfvars.json
├── 03-gke-multitenant-dev
│   └── terraform-bootstrap.auto.tfvars.json
├── 03-gke-multitenant-prod
│   └── terraform-bootstrap.auto.tfvars.json
├── 03-project-factory-dev
│   └── terraform-bootstrap.auto.tfvars.json
├── 03-project-factory-prod
│   └── terraform-bootstrap.auto.tfvars.json

Running the stage

Before running init and apply, check your environment so no extra variables that might influence authentication are present (e.g. GOOGLE_IMPERSONATE_SERVICE_ACCOUNT). In general you should use user application credentials, and FAST will then take care to provision automation identities and configure impersonation for you.

When running the first apply as a user, you need to pass a special runtime variable so that the user roles are preserved when setting IAM bindings.

terraform init
terraform apply \
  -var bootstrap_user=$(gcloud config list --format 'value(core.account)')

Once the initial apply completes successfully, configure a remote backend using the new GCS bucket, and impersonation on the automation service account for this stage. To do this, you can use the generated providers.tf file if you have configured output files as described above, or extract its contents from Terraform's output, then migrate state with terraform init:

# if using output files via the outputs_location and set to `../../config`
ln -s ../../config/00-bootstrap/* ./
# or from outputs if not using output files
terraform output -json providers | jq -r '.["00-bootstrap"]' \
  > providers.tf
# migrate state to GCS bucket configured in providers file
terraform init -migrate-state
# run terraform apply to remo user iam binding 
terraform apply

Customizations

Most variables (e.g. billing_account and organization) are only used to input actual values and should be self-explanatory. The only meaningful customizations that apply here are groups, and IAM roles.

Group names

As we mentioned above, groups reflect the convention used in the GCP Enterprise Setup Checklist, with an added level of indirection: the groups variable maps logical names to actual names, so that you don't need to delve into the code if your group names do not comply with the checklist convention.

For example, if your network admins team is called net-rockstars@example.com, simply set that name in the variable, minus the domain which is interpolated internally with the organization domain:

variable "groups" {
  description = "Group names to grant organization-level permissions."
  type        = map(string)
  default = {
    gcp-network-admins      = "net-rockstars"
    # [...]
  }
}

If your groups layout differs substantially from the checklist, define all relevant groups in the groups variable, then rearrange IAM roles in the code to match your setup.

IAM

One other area where we directly support customizations is IAM. The code here, as in all stages, follows a simple pattern derived from best practices:

  • operational roles for humans are assigned to groups
  • any other principal is a service account

In code, the distinction above reflects on how IAM bindings are specified in the underlying module variables:

  • group roles "for humans" always use iam_groups variables
  • service account roles always use iam variables

This makes it easy to tweak user roles by adding mappings to the iam_groups variables of the relevant resources, without having to understand and deal with the details of service account roles.

In those cases where roles need to be assigned to end-user service accounts (e.g. an application or pipeline service account), we offer a stage-level iam variable that allows pinpointing individual role/members pairs, without having to touch the code internals, to avoid the risk of breaking a critical role for a robot account. The variable can also be used to assign roles to specific users or to groups external to the organization, e.g. to support external suppliers.

The one exception to this convention is for roles which are part of the delegated grant condition described above, and which can then be assigned from other stages. In this case, use the iam_additive variable as they are implemented with non-authoritative resources. Using non-authoritative bindings ensure that re-executing this stage will not override any bindings set in downstream stages.

Names and naming convention

Configuring the individual tokens for the naming convention described above, has varying degrees of complexity:

  • the static prefix can be set via the prefix variable once
  • the environment identifier is set to prod as resources here influence production and are considered as such, and can be changed in main.tf locals

All other tokens are set directly in resource names, as providing abstractions to manage them would have added too much complexity to the code, making it less readable and more fragile.

If a different convention is needed, identify names via search/grep (e.g. with ^\s+name\s+=\s+") and change them in an editor: it should take a couple of minutes at most, as there's just a handful of modules and resources to change.

Names used in internal references (e.g. module.foo-prod.id) are only used by Terraform and do not influence resource naming, so they are best left untouched to avoid having to debug complex errors.

Files

name description modules resources
automation.tf Automation project and resources. gcs · iam-service-account · project
billing.tf Billing export project and dataset. bigquery-dataset · organization · project google_billing_account_iam_member · google_organization_iam_binding
log-export.tf Audit log project and sink. bigquery-dataset · gcs · logging-bucket · project · pubsub
main.tf Module-level locals and resources.
organization.tf Organization-level IAM. organization google_organization_iam_binding
outputs.tf Module outputs. local_file
variables.tf Module variables.

Variables

name description type required default producer
billing_account Billing account id and organization id ('nnnnnnnn' or null). object({…})
organization Organization details. object({…})
prefix Prefix used for resources that need unique names. string
bootstrap_user Email of the nominal user running this stage for the first time. string null
groups Group names to grant organization-level permissions. map(string) {…}
iam Organization-level custom IAM settings in role => [principal] format. map(list(string)) {}
iam_additive Organization-level custom IAM settings in role => [principal] format for non-authoritative bindings. map(list(string)) {}
log_sinks Org-level log sinks, in name => {type, filter} format. map(object({…})) {…}
outputs_location Path where providers and tfvars files for the following stages are written. Leave empty to disable. string null

Outputs

name description sensitive consumers
billing_dataset BigQuery dataset prepared for billing export.
project_ids Projects created by this stage.
providers Terraform provider files for this stage and dependent stages. stage-01
tfvars Terraform variable files for the following stages.