Replace existing data platform

This commit is contained in:
Lorenzo Caggioni 2022-02-05 08:51:11 +01:00
parent 2cdea57954
commit b65d153ec1
53 changed files with 237 additions and 1523 deletions

View File

@ -5,7 +5,7 @@ This section contains **[foundational examples](./foundations/)** that bootstrap
Currently available examples:
- **cloud operations** - [Resource tracking and remediation via Cloud Asset feeds](./cloud-operations/asset-inventory-feed-remediation), [Granular Cloud DNS IAM via Service Directory](./cloud-operations/dns-fine-grained-iam), [Granular Cloud DNS IAM for Shared VPC](./cloud-operations/dns-shared-vpc), [Compute Engine quota monitoring](./cloud-operations/quota-monitoring), [Scheduled Cloud Asset Inventory Export to Bigquery](./cloud-operations/scheduled-asset-inventory-export-bq), [Packer image builder](./cloud-operations/packer-image-builder), [On-prem SA key management](./cloud-operations/onprem-sa-key-management)
- **data solutions** - [GCE/GCS CMEK via centralized Cloud KMS](./data-solutions/cmek-via-centralized-kms/), [Cloud Storage to Bigquery with Cloud Dataflow](./data-solutions/gcs-to-bq-with-dataflow/)
- **data solutions** - [GCE/GCS CMEK via centralized Cloud KMS](./data-solutions/cmek-via-centralized-kms/), [Cloud Storage to Bigquery with Cloud Dataflow](./data-solutions/gcs-to-bq-with-dataflow/), [Data Platform Foundations](./data-solutions/data-platform-foundations/)
- **factories** - [The why and the how of resource factories](./factories/README.md)
- **foundations** - [single level hierarchy](./foundations/environments/) (environments), [multiple level hierarchy](./foundations/business-units/) (business units + environments)
- **networking** - [hub and spoke via peering](./networking/hub-and-spoke-peering/), [hub and spoke via VPN](./networking/hub-and-spoke-vpn/), [DNS and Google Private Access for on-premises](./networking/onprem-google-access-dns/), [Shared VPC with GKE support](./networking/shared-vpc-gke/), [ILB as next hop](./networking/ilb-next-hop), [PSC for on-premises Cloud Function invocation](./networking/private-cloud-function-from-onprem/), [decentralized firewall](./networking/decentralized-firewall)

View File

@ -18,7 +18,7 @@ All resources use CMEK hosted in Cloud KMS running in a centralized project. The
### Data Platform Foundations
<a href="./data-platform-foundations/" title="Data Platform Foundations"><img src="./data-platform-foundations/02-resources/diagram.png" align="left" width="280px"></a>
<a href="./data-platform-foundations/" title="Data Platform Foundations"><img src="./data-platform-foundations/images/overview_diagram.png" align="left" width="280px"></a>
This [example](./data-platform-foundations/) implements a robust and flexible Data Foundation on GCP that provides opinionated defaults, allowing customers to build and scale out additional data pipelines quickly and reliably.
<br clear="left">

View File

@ -1,72 +0,0 @@
# Data Platform Foundations - Environment (Step 1)
This is the first step needed to deploy Data Platform Foundations, which creates projects and service accounts. Please refer to the [top-level Data Platform README](../README.md) for prerequisites.
The projects that will be created are:
- Common services
- Landing
- Orchestration & Transformation
- DWH
- Datamart
A main service account named `projects-editor-sa` will be created under the common services project, and it will be granted editor permissions on all the projects in scope.
This is a high level diagram of the created resources:
![Environment - Phase 1](./diagram.png "High-level Environment diagram")
## Running the example
To create the infrastructure:
- specify your variables in a `terraform.tvars`
```tfm
billing_account = "1234-1234-1234"
parent = "folders/12345678"
admins = ["user:xxxxx@yyyyy.com"]
```
- make sure you have the right authentication setup (application default credentials, or a service account key) with the right permissions
- **The output of this stage contains the values for the resources stage**
- the `admins` variable contain a list of principals allowed to impersonate the service accounts. These principals will be given the `iam.serviceAccountTokenCreator` role
- run `terraform init` and `terraform apply`
Once done testing, you can clean up resources by running `terraform destroy`.
### CMEK configuration
You can configure GCP resources to use existing CMEK keys configuring the 'service_encryption_key_ids' variable. You need to specify a 'global' and a 'multiregional' key.
### VPC-SC configuration
You can assign projects to an existing VPC-SC standard perimeter configuring the 'service_perimeter_standard' variable. You can retrieve the list of existing perimeters from the GCP console or using the following command:
'''
gcloud access-context-manager perimeters list --format="json" | grep name
'''
The script use 'google_access_context_manager_service_perimeter_resource' terraform resource. If this resource is used alongside the 'vpc-sc' module, remember to uncomment the lifecycle block in the 'vpc-sc' module so they don't fight over which resources should be in the perimeter.
<!-- BEGIN TFDOC -->
## Variables
| name | description | type | required | default |
|---|---|:---:|:---:|:---:|
| [billing_account_id](variables.tf#L21) | Billing account id. | <code>string</code> | ✓ | |
| [root_node](variables.tf#L50) | Parent folder or organization in 'folders/folder_id' or 'organizations/org_id' format. | <code>string</code> | ✓ | |
| [admins](variables.tf#L15) | List of users allowed to impersonate the service account. | <code>list&#40;string&#41;</code> | | <code>null</code> |
| [prefix](variables.tf#L26) | Prefix used to generate project id and name. | <code>string</code> | | <code>null</code> |
| [project_names](variables.tf#L32) | Override this variable if you need non-standard names. | <code title="object&#40;&#123;&#10; datamart &#61; string&#10; dwh &#61; string&#10; landing &#61; string&#10; services &#61; string&#10; transformation &#61; string&#10;&#125;&#41;">object&#40;&#123;&#8230;&#125;&#41;</code> | | <code title="&#123;&#10; datamart &#61; &#34;datamart&#34;&#10; dwh &#61; &#34;datawh&#34;&#10; landing &#61; &#34;landing&#34;&#10; services &#61; &#34;services&#34;&#10; transformation &#61; &#34;transformation&#34;&#10;&#125;">&#123;&#8230;&#125;</code> |
| [service_account_names](variables.tf#L55) | Override this variable if you need non-standard names. | <code title="object&#40;&#123;&#10; main &#61; string&#10;&#125;&#41;">object&#40;&#123;&#8230;&#125;&#41;</code> | | <code title="&#123;&#10; main &#61; &#34;data-platform-main&#34;&#10;&#125;">&#123;&#8230;&#125;</code> |
| [service_encryption_key_ids](variables.tf#L65) | Cloud KMS encryption key in {LOCATION => [KEY_URL]} format. Keys belong to existing project. | <code title="object&#40;&#123;&#10; multiregional &#61; string&#10; global &#61; string&#10;&#125;&#41;">object&#40;&#123;&#8230;&#125;&#41;</code> | | <code title="&#123;&#10; multiregional &#61; null&#10; global &#61; null&#10;&#125;">&#123;&#8230;&#125;</code> |
| [service_perimeter_standard](variables.tf#L78) | VPC Service control standard perimeter name in the form of 'accessPolicies/ACCESS_POLICY_NAME/servicePerimeters/PERIMETER_NAME'. All projects will be added to the perimeter in enforced mode. | <code>string</code> | | <code>null</code> |
## Outputs
| name | description | sensitive |
|---|---|:---:|
| [project_ids](outputs.tf#L17) | Project ids for created projects. | |
| [service_account](outputs.tf#L28) | Main service account. | |
| [service_encryption_key_ids](outputs.tf#L33) | Cloud KMS encryption keys in {LOCATION => [KEY_URL]} format. | |
<!-- END TFDOC -->

Binary file not shown.

Before

Width:  |  Height:  |  Size: 275 KiB

View File

@ -1,162 +0,0 @@
/**
* Copyright 2020 Google LLC
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
###############################################################################
# projects #
###############################################################################
module "project-datamart" {
source = "../../../../modules/project"
parent = var.root_node
billing_account = var.billing_account_id
prefix = var.prefix
name = var.project_names.datamart
services = [
"bigquery.googleapis.com",
"bigquerystorage.googleapis.com",
"bigqueryreservation.googleapis.com",
"storage.googleapis.com",
"storage-component.googleapis.com",
]
iam_additive = {
"roles/owner" = [module.sa-services-main.iam_email]
}
service_encryption_key_ids = {
bq = [var.service_encryption_key_ids.multiregional]
storage = [var.service_encryption_key_ids.multiregional]
}
# If used, remember to uncomment 'lifecycle' block in the
# modules/vpc-sc/google_access_context_manager_service_perimeter resource.
service_perimeter_standard = var.service_perimeter_standard
}
module "project-dwh" {
source = "../../../../modules/project"
parent = var.root_node
billing_account = var.billing_account_id
prefix = var.prefix
name = var.project_names.dwh
services = [
"bigquery.googleapis.com",
"bigquerystorage.googleapis.com",
"bigqueryreservation.googleapis.com",
"storage.googleapis.com",
"storage-component.googleapis.com",
]
iam_additive = {
"roles/owner" = [module.sa-services-main.iam_email]
}
service_encryption_key_ids = {
bq = [var.service_encryption_key_ids.multiregional]
storage = [var.service_encryption_key_ids.multiregional]
}
# If used, remember to uncomment 'lifecycle' block in the
# modules/vpc-sc/google_access_context_manager_service_perimeter resource.
service_perimeter_standard = var.service_perimeter_standard
}
module "project-landing" {
source = "../../../../modules/project"
parent = var.root_node
billing_account = var.billing_account_id
prefix = var.prefix
name = var.project_names.landing
services = [
"pubsub.googleapis.com",
"storage.googleapis.com",
"storage-component.googleapis.com",
]
iam_additive = {
"roles/owner" = [module.sa-services-main.iam_email]
}
service_encryption_key_ids = {
pubsub = [var.service_encryption_key_ids.global]
storage = [var.service_encryption_key_ids.multiregional]
}
# If used, remember to uncomment 'lifecycle' block in the
# modules/vpc-sc/google_access_context_manager_service_perimeter resource.
service_perimeter_standard = var.service_perimeter_standard
}
module "project-services" {
source = "../../../../modules/project"
parent = var.root_node
billing_account = var.billing_account_id
prefix = var.prefix
name = var.project_names.services
services = [
"bigquery.googleapis.com",
"cloudresourcemanager.googleapis.com",
"iam.googleapis.com",
"pubsub.googleapis.com",
"storage.googleapis.com",
"storage-component.googleapis.com",
"sourcerepo.googleapis.com",
"stackdriver.googleapis.com",
"cloudasset.googleapis.com",
"cloudkms.googleapis.com"
]
iam_additive = {
"roles/owner" = [module.sa-services-main.iam_email]
}
service_encryption_key_ids = {
storage = [var.service_encryption_key_ids.multiregional]
}
# If used, remember to uncomment 'lifecycle' block in the
# modules/vpc-sc/google_access_context_manager_service_perimeter resource.
service_perimeter_standard = var.service_perimeter_standard
}
module "project-transformation" {
source = "../../../../modules/project"
parent = var.root_node
billing_account = var.billing_account_id
prefix = var.prefix
name = var.project_names.transformation
services = [
"bigquery.googleapis.com",
"cloudbuild.googleapis.com",
"compute.googleapis.com",
"dataflow.googleapis.com",
"servicenetworking.googleapis.com",
"storage.googleapis.com",
"storage-component.googleapis.com",
]
iam_additive = {
"roles/owner" = [module.sa-services-main.iam_email]
}
service_encryption_key_ids = {
compute = [var.service_encryption_key_ids.global]
storage = [var.service_encryption_key_ids.multiregional]
dataflow = [var.service_encryption_key_ids.global]
}
# If used, remember to uncomment 'lifecycle' block in the
# modules/vpc-sc/google_access_context_manager_service_perimeter resource.
service_perimeter_standard = var.service_perimeter_standard
}
###############################################################################
# service accounts #
###############################################################################
module "sa-services-main" {
source = "../../../../modules/iam-service-account"
project_id = module.project-services.project_id
name = var.service_account_names.main
iam = var.admins != null ? { "roles/iam.serviceAccountTokenCreator" = var.admins } : {}
}

View File

@ -1,36 +0,0 @@
/**
* Copyright 2020 Google LLC
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
output "project_ids" {
description = "Project ids for created projects."
value = {
datamart = module.project-datamart.project_id
dwh = module.project-dwh.project_id
landing = module.project-landing.project_id
services = module.project-services.project_id
transformation = module.project-transformation.project_id
}
}
output "service_account" {
description = "Main service account."
value = module.sa-services-main.email
}
output "service_encryption_key_ids" {
description = "Cloud KMS encryption keys in {LOCATION => [KEY_URL]} format."
value = var.service_encryption_key_ids
}

View File

@ -1,82 +0,0 @@
# Copyright 2020 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
variable "admins" {
description = "List of users allowed to impersonate the service account."
type = list(string)
default = null
}
variable "billing_account_id" {
description = "Billing account id."
type = string
}
variable "prefix" {
description = "Prefix used to generate project id and name."
type = string
default = null
}
variable "project_names" {
description = "Override this variable if you need non-standard names."
type = object({
datamart = string
dwh = string
landing = string
services = string
transformation = string
})
default = {
datamart = "datamart"
dwh = "datawh"
landing = "landing"
services = "services"
transformation = "transformation"
}
}
variable "root_node" {
description = "Parent folder or organization in 'folders/folder_id' or 'organizations/org_id' format."
type = string
}
variable "service_account_names" {
description = "Override this variable if you need non-standard names."
type = object({
main = string
})
default = {
main = "data-platform-main"
}
}
variable "service_encryption_key_ids" {
description = "Cloud KMS encryption key in {LOCATION => [KEY_URL]} format. Keys belong to existing project."
type = object({
multiregional = string
global = string
})
default = {
multiregional = null
global = null
}
}
variable "service_perimeter_standard" {
description = "VPC Service control standard perimeter name in the form of 'accessPolicies/ACCESS_POLICY_NAME/servicePerimeters/PERIMETER_NAME'. All projects will be added to the perimeter in enforced mode."
type = string
default = null
}

View File

@ -1,29 +0,0 @@
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
terraform {
required_version = ">= 1.0.0"
required_providers {
google = {
source = "hashicorp/google"
version = ">= 4.0.0"
}
google-beta = {
source = "hashicorp/google-beta"
version = ">= 4.0.0"
}
}
}

View File

@ -1,83 +0,0 @@
# Data Platform Foundations - Resources (Step 2)
This is the second step needed to deploy Data Platform Foundations, which creates resources needed to store and process the data, in the projects created in the [previous step](../01-environment/README.md). Please refer to the [top-level README](../README.md) for prerequisites and how to run the first step.
![Data Foundation - Phase 2](./diagram.png "High-level diagram")
The resources that will be create in each project are:
- Common
- Landing
- [x] GCS
- [x] Pub/Sub
- Orchestration & Transformation
- [x] Dataflow
- DWH
- [x] Bigquery (L0/1/2)
- [x] GCS
- Datamart
- [x] Bigquery (views/table)
- [x] GCS
- [ ] BigTable
## Running the example
In the previous step, we created the environment (projects and service account) which we are going to use in this step.
To create the resources, copy the output of the environment step (**project_ids**) and paste it into the `terraform.tvars`:
- Specify your variables in a `terraform.tvars`, you can use the output from the environment stage
```tfm
project_ids = {
datamart = "datamart-project_id"
dwh = "dwh-project_id"
landing = "landing-project_id"
services = "services-project_id"
transformation = "transformation-project_id"
}
```
- The providers.tf file has been configured to impersonate the **main** service account
- To launch terraform:
```bash
terraform plan
terraform apply
```
Once done testing, you can clean up resources by running `terraform destroy`.
### CMEK configuration
You can configure GCP resources to use existing CMEK keys configuring the 'service_encryption_key_ids' variable. You need to specify a 'global' and a 'multiregional' key.
<!-- BEGIN TFDOC -->
## Variables
| name | description | type | required | default |
|---|---|:---:|:---:|:---:|
| [project_ids](variables.tf#L108) | Project IDs. | <code title="object&#40;&#123;&#10; datamart &#61; string&#10; dwh &#61; string&#10; landing &#61; string&#10; services &#61; string&#10; transformation &#61; string&#10;&#125;&#41;">object&#40;&#123;&#8230;&#125;&#41;</code> | ✓ | |
| [admins](variables.tf#L16) | List of users allowed to impersonate the service account. | <code>list&#40;string&#41;</code> | | <code>null</code> |
| [datamart_bq_datasets](variables.tf#L22) | Datamart Bigquery datasets. | <code title="map&#40;object&#40;&#123;&#10; iam &#61; map&#40;list&#40;string&#41;&#41;&#10; location &#61; string&#10;&#125;&#41;&#41;">map&#40;object&#40;&#123;&#8230;&#125;&#41;&#41;</code> | | <code title="&#123;&#10; bq_datamart_dataset &#61; &#123;&#10; location &#61; &#34;EU&#34;&#10; iam &#61; &#123;&#10; &#125;&#10; &#125;&#10;&#125;">&#123;&#8230;&#125;</code> |
| [dwh_bq_datasets](variables.tf#L40) | DWH Bigquery datasets. | <code title="map&#40;object&#40;&#123;&#10; location &#61; string&#10; iam &#61; map&#40;list&#40;string&#41;&#41;&#10;&#125;&#41;&#41;">map&#40;object&#40;&#123;&#8230;&#125;&#41;&#41;</code> | | <code title="&#123;&#10; bq_raw_dataset &#61; &#123;&#10; iam &#61; &#123;&#125;&#10; location &#61; &#34;EU&#34;&#10; &#125;&#10;&#125;">&#123;&#8230;&#125;</code> |
| [landing_buckets](variables.tf#L54) | List of landing buckets to create. | <code title="map&#40;object&#40;&#123;&#10; location &#61; string&#10; name &#61; string&#10;&#125;&#41;&#41;">map&#40;object&#40;&#123;&#8230;&#125;&#41;&#41;</code> | | <code title="&#123;&#10; raw-data &#61; &#123;&#10; location &#61; &#34;EU&#34;&#10; name &#61; &#34;raw-data&#34;&#10; &#125;&#10; data-schema &#61; &#123;&#10; location &#61; &#34;EU&#34;&#10; name &#61; &#34;data-schema&#34;&#10; &#125;&#10;&#125;">&#123;&#8230;&#125;</code> |
| [landing_pubsub](variables.tf#L72) | List of landing pubsub topics and subscriptions to create. | <code title="map&#40;map&#40;object&#40;&#123;&#10; iam &#61; map&#40;list&#40;string&#41;&#41;&#10; labels &#61; map&#40;string&#41;&#10; options &#61; object&#40;&#123;&#10; ack_deadline_seconds &#61; number&#10; message_retention_duration &#61; number&#10; retain_acked_messages &#61; bool&#10; expiration_policy_ttl &#61; number&#10; &#125;&#41;&#10;&#125;&#41;&#41;&#41;">map&#40;map&#40;object&#40;&#123;&#8230;&#125;&#41;&#41;&#41;</code> | | <code title="&#123;&#10; landing-1 &#61; &#123;&#10; sub1 &#61; &#123;&#10; iam &#61; &#123;&#10; &#125;&#10; labels &#61; &#123;&#125;&#10; options &#61; null&#10; &#125;&#10; sub2 &#61; &#123;&#10; iam &#61; &#123;&#125;&#10; labels &#61; &#123;&#125;,&#10; options &#61; null&#10; &#125;,&#10; &#125;&#10;&#125;">&#123;&#8230;&#125;</code> |
| [landing_service_account](variables.tf#L102) | landing service accounts list. | <code>string</code> | | <code>&#34;sa-landing&#34;</code> |
| [service_account_names](variables.tf#L119) | Project service accounts list. | <code title="object&#40;&#123;&#10; datamart &#61; string&#10; dwh &#61; string&#10; landing &#61; string&#10; services &#61; string&#10; transformation &#61; string&#10;&#125;&#41;">object&#40;&#123;&#8230;&#125;&#41;</code> | | <code title="&#123;&#10; datamart &#61; &#34;sa-datamart&#34;&#10; dwh &#61; &#34;sa-datawh&#34;&#10; landing &#61; &#34;sa-landing&#34;&#10; services &#61; &#34;sa-services&#34;&#10; transformation &#61; &#34;sa-transformation&#34;&#10;&#125;">&#123;&#8230;&#125;</code> |
| [service_encryption_key_ids](variables.tf#L137) | Cloud KMS encryption key in {LOCATION => [KEY_URL]} format. Keys belong to existing project. | <code title="object&#40;&#123;&#10; multiregional &#61; string&#10; global &#61; string&#10;&#125;&#41;">object&#40;&#123;&#8230;&#125;&#41;</code> | | <code title="&#123;&#10; multiregional &#61; null&#10; global &#61; null&#10;&#125;">&#123;&#8230;&#125;</code> |
| [transformation_buckets](variables.tf#L149) | List of transformation buckets to create. | <code title="map&#40;object&#40;&#123;&#10; location &#61; string&#10; name &#61; string&#10;&#125;&#41;&#41;">map&#40;object&#40;&#123;&#8230;&#125;&#41;&#41;</code> | | <code title="&#123;&#10; temp &#61; &#123;&#10; location &#61; &#34;EU&#34;&#10; name &#61; &#34;temp&#34;&#10; &#125;,&#10; templates &#61; &#123;&#10; location &#61; &#34;EU&#34;&#10; name &#61; &#34;templates&#34;&#10; &#125;,&#10;&#125;">&#123;&#8230;&#125;</code> |
| [transformation_subnets](variables.tf#L167) | List of subnets to create in the transformation Project. | <code title="list&#40;object&#40;&#123;&#10; ip_cidr_range &#61; string&#10; name &#61; string&#10; region &#61; string&#10; secondary_ip_range &#61; map&#40;string&#41;&#10;&#125;&#41;&#41;">list&#40;object&#40;&#123;&#8230;&#125;&#41;&#41;</code> | | <code title="&#91;&#10; &#123;&#10; ip_cidr_range &#61; &#34;10.1.0.0&#47;20&#34;&#10; name &#61; &#34;transformation-subnet&#34;&#10; region &#61; &#34;europe-west3&#34;&#10; secondary_ip_range &#61; &#123;&#125;&#10; &#125;,&#10;&#93;">&#91;&#8230;&#93;</code> |
| [transformation_vpc_name](variables.tf#L185) | Name of the VPC created in the transformation Project. | <code>string</code> | | <code>&#34;transformation-vpc&#34;</code> |
## Outputs
| name | description | sensitive |
|---|---|:---:|
| [datamart-datasets](outputs.tf#L17) | List of bigquery datasets created for the datamart project. | |
| [dwh-datasets](outputs.tf#L24) | List of bigquery datasets created for the dwh project. | |
| [landing-buckets](outputs.tf#L29) | List of buckets created for the landing project. | |
| [landing-pubsub](outputs.tf#L34) | List of pubsub topics and subscriptions created for the landing project. | |
| [transformation-buckets](outputs.tf#L44) | List of buckets created for the transformation project. | |
| [transformation-vpc](outputs.tf#L49) | Transformation VPC details. | |
<!-- END TFDOC -->

Binary file not shown.

Before

Width:  |  Height:  |  Size: 470 KiB

View File

@ -1,211 +0,0 @@
/**
* Copyright 2020 Google LLC
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
###############################################################################
# IAM #
###############################################################################
module "datamart-sa" {
source = "../../../../modules/iam-service-account"
project_id = var.project_ids.datamart
name = var.service_account_names.datamart
iam_project_roles = {
"${var.project_ids.datamart}" = ["roles/editor"]
}
iam = var.admins != null ? { "roles/iam.serviceAccountTokenCreator" = var.admins } : {}
}
module "dwh-sa" {
source = "../../../../modules/iam-service-account"
project_id = var.project_ids.dwh
name = var.service_account_names.dwh
iam_project_roles = {
"${var.project_ids.dwh}" = ["roles/bigquery.admin"]
}
iam = var.admins != null ? { "roles/iam.serviceAccountTokenCreator" = var.admins } : {}
}
module "landing-sa" {
source = "../../../../modules/iam-service-account"
project_id = var.project_ids.landing
name = var.service_account_names.landing
iam_project_roles = {
"${var.project_ids.landing}" = [
"roles/pubsub.publisher",
"roles/storage.objectCreator"]
}
iam = var.admins != null ? { "roles/iam.serviceAccountTokenCreator" = var.admins } : {}
}
module "services-sa" {
source = "../../../../modules/iam-service-account"
project_id = var.project_ids.services
name = var.service_account_names.services
iam_project_roles = {
"${var.project_ids.services}" = ["roles/editor"]
}
iam = var.admins != null ? { "roles/iam.serviceAccountTokenCreator" = var.admins } : {}
}
module "transformation-sa" {
source = "../../../../modules/iam-service-account"
project_id = var.project_ids.transformation
name = var.service_account_names.transformation
iam_project_roles = {
"${var.project_ids.transformation}" = [
"roles/logging.logWriter",
"roles/monitoring.metricWriter",
"roles/dataflow.admin",
"roles/iam.serviceAccountUser",
"roles/bigquery.dataOwner",
"roles/bigquery.jobUser",
"roles/dataflow.worker",
"roles/bigquery.metadataViewer",
"roles/storage.objectViewer",
],
"${var.project_ids.landing}" = [
"roles/storage.objectViewer",
],
"${var.project_ids.dwh}" = [
"roles/bigquery.dataOwner",
"roles/bigquery.jobUser",
"roles/bigquery.metadataViewer",
]
}
iam = var.admins != null ? { "roles/iam.serviceAccountTokenCreator" = var.admins } : {}
}
###############################################################################
# GCS #
###############################################################################
module "landing-buckets" {
source = "../../../../modules/gcs"
for_each = var.landing_buckets
project_id = var.project_ids.landing
prefix = var.project_ids.landing
name = each.value.name
location = each.value.location
iam = {
"roles/storage.objectCreator" = [module.landing-sa.iam_email]
"roles/storage.admin" = [module.transformation-sa.iam_email]
}
encryption_key = var.service_encryption_key_ids.multiregional
}
module "transformation-buckets" {
source = "../../../../modules/gcs"
for_each = var.transformation_buckets
project_id = var.project_ids.transformation
prefix = var.project_ids.transformation
name = each.value.name
location = each.value.location
iam = {
"roles/storage.admin" = [module.transformation-sa.iam_email]
}
encryption_key = var.service_encryption_key_ids.multiregional
}
###############################################################################
# Bigquery #
###############################################################################
module "datamart-bq" {
source = "../../../../modules/bigquery-dataset"
for_each = var.datamart_bq_datasets
project_id = var.project_ids.datamart
id = each.key
location = each.value.location
iam = {
for k, v in each.value.iam : k => (
k == "roles/bigquery.dataOwner"
? concat(v, [module.datamart-sa.iam_email])
: v
)
}
encryption_key = var.service_encryption_key_ids.multiregional
}
module "dwh-bq" {
source = "../../../../modules/bigquery-dataset"
for_each = var.dwh_bq_datasets
project_id = var.project_ids.dwh
id = each.key
location = each.value.location
iam = {
for k, v in each.value.iam : k => (
k == "roles/bigquery.dataOwner"
? concat(v, [module.dwh-sa.iam_email])
: v
)
}
encryption_key = var.service_encryption_key_ids.multiregional
}
###############################################################################
# Network #
###############################################################################
module "vpc-transformation" {
source = "../../../../modules/net-vpc"
project_id = var.project_ids.transformation
name = var.transformation_vpc_name
subnets = var.transformation_subnets
}
module "firewall" {
source = "../../../../modules/net-vpc-firewall"
project_id = var.project_ids.transformation
network = module.vpc-transformation.name
admin_ranges = []
http_source_ranges = []
https_source_ranges = []
ssh_source_ranges = []
custom_rules = {
iap-svc = {
description = "Dataflow service."
direction = "INGRESS"
action = "allow"
sources = ["dataflow"]
targets = ["dataflow"]
ranges = []
use_service_accounts = false
rules = [{ protocol = "tcp", ports = ["12345-12346"] }]
extra_attributes = {}
}
}
}
###############################################################################
# Pub/Sub #
###############################################################################
module "landing-pubsub" {
source = "../../../../modules/pubsub"
for_each = var.landing_pubsub
project_id = var.project_ids.landing
name = each.key
subscriptions = {
for k, v in each.value : k => { labels = v.labels, options = v.options }
}
subscription_iam = {
for k, v in each.value : k => merge(v.iam, {
"roles/pubsub.subscriber" = [module.transformation-sa.iam_email]
})
}
kms_key = var.service_encryption_key_ids.global
}

View File

@ -1,60 +0,0 @@
/**
* Copyright 2020 Google LLC
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
output "datamart-datasets" {
description = "List of bigquery datasets created for the datamart project."
value = [
for k, datasets in module.datamart-bq : datasets.dataset_id
]
}
output "dwh-datasets" {
description = "List of bigquery datasets created for the dwh project."
value = [for k, datasets in module.dwh-bq : datasets.dataset_id]
}
output "landing-buckets" {
description = "List of buckets created for the landing project."
value = [for k, bucket in module.landing-buckets : bucket.name]
}
output "landing-pubsub" {
description = "List of pubsub topics and subscriptions created for the landing project."
value = {
for t in module.landing-pubsub : t.topic.name => {
id = t.topic.id
subscriptions = { for s in t.subscriptions : s.name => s.id }
}
}
}
output "transformation-buckets" {
description = "List of buckets created for the transformation project."
value = [for k, bucket in module.transformation-buckets : bucket.name]
}
output "transformation-vpc" {
description = "Transformation VPC details."
value = {
name = module.vpc-transformation.name
subnets = {
for k, s in module.vpc-transformation.subnets : k => {
ip_cidr_range = s.ip_cidr_range
region = s.region
}
}
}
}

View File

@ -1,23 +0,0 @@
/**
* Copyright 2022 Google LLC
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
provider "google" {
impersonate_service_account = "data-platform-main@${var.project_ids.services}.iam.gserviceaccount.com"
}
provider "google-beta" {
impersonate_service_account = "data-platform-main@${var.project_ids.services}.iam.gserviceaccount.com"
}

View File

@ -1,189 +0,0 @@
# Copyright 2020 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
variable "admins" {
description = "List of users allowed to impersonate the service account."
type = list(string)
default = null
}
variable "datamart_bq_datasets" {
description = "Datamart Bigquery datasets."
type = map(object({
iam = map(list(string))
location = string
}))
default = {
bq_datamart_dataset = {
location = "EU"
iam = {
# "roles/bigquery.dataOwner" = []
# "roles/bigquery.dataEditor" = []
# "roles/bigquery.dataViewer" = []
}
}
}
}
variable "dwh_bq_datasets" {
description = "DWH Bigquery datasets."
type = map(object({
location = string
iam = map(list(string))
}))
default = {
bq_raw_dataset = {
iam = {}
location = "EU"
}
}
}
variable "landing_buckets" {
description = "List of landing buckets to create."
type = map(object({
location = string
name = string
}))
default = {
raw-data = {
location = "EU"
name = "raw-data"
}
data-schema = {
location = "EU"
name = "data-schema"
}
}
}
variable "landing_pubsub" {
description = "List of landing pubsub topics and subscriptions to create."
type = map(map(object({
iam = map(list(string))
labels = map(string)
options = object({
ack_deadline_seconds = number
message_retention_duration = number
retain_acked_messages = bool
expiration_policy_ttl = number
})
})))
default = {
landing-1 = {
sub1 = {
iam = {
# "roles/pubsub.subscriber" = []
}
labels = {}
options = null
}
sub2 = {
iam = {}
labels = {},
options = null
},
}
}
}
variable "landing_service_account" {
description = "landing service accounts list."
type = string
default = "sa-landing"
}
variable "project_ids" {
description = "Project IDs."
type = object({
datamart = string
dwh = string
landing = string
services = string
transformation = string
})
}
variable "service_account_names" {
description = "Project service accounts list."
type = object({
datamart = string
dwh = string
landing = string
services = string
transformation = string
})
default = {
datamart = "sa-datamart"
dwh = "sa-datawh"
landing = "sa-landing"
services = "sa-services"
transformation = "sa-transformation"
}
}
variable "service_encryption_key_ids" {
description = "Cloud KMS encryption key in {LOCATION => [KEY_URL]} format. Keys belong to existing project."
type = object({
multiregional = string
global = string
})
default = {
multiregional = null
global = null
}
}
variable "transformation_buckets" {
description = "List of transformation buckets to create."
type = map(object({
location = string
name = string
}))
default = {
temp = {
location = "EU"
name = "temp"
},
templates = {
location = "EU"
name = "templates"
},
}
}
variable "transformation_subnets" {
description = "List of subnets to create in the transformation Project."
type = list(object({
ip_cidr_range = string
name = string
region = string
secondary_ip_range = map(string)
}))
default = [
{
ip_cidr_range = "10.1.0.0/20"
name = "transformation-subnet"
region = "europe-west3"
secondary_ip_range = {}
},
]
}
variable "transformation_vpc_name" {
description = "Name of the VPC created in the transformation Project."
type = string
default = "transformation-vpc"
}

View File

@ -1,29 +0,0 @@
# Copyright 2022 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
terraform {
required_version = ">= 1.0.0"
required_providers {
google = {
source = "hashicorp/google"
version = ">= 4.0.0"
}
google-beta = {
source = "hashicorp/google-beta"
version = ">= 4.0.0"
}
}
}

View File

@ -1,8 +0,0 @@
# Manual pipeline Example
Once you deployed projects [step 1](../01-environment/README.md) and resources [step 2](../02-resources/README.md) you can use it to run your data pipeline.
Here we will demo 2 pipelines:
* [GCS to Bigquery](./gcs_to_bigquery.md)
* [PubSub to Bigquery](./pubsub_to_bigquery.md)

View File

@ -1,140 +0,0 @@
# Manual pipeline Example: GCS to Bigquery
In this example we will publish person message in the following format:
```bash
name,surname,1617898199
```
A Dataflow pipeline will read those messages and import them into a Bigquery table in the DWH project.
[TODO] An autorized view will be created in the datamart project to expose the table.
[TODO] Further automation is expected in future.
## Set up the env vars
```bash
export DWH_PROJECT_ID=**dwh_project_id**
export LANDING_PROJECT_ID=**landing_project_id**
export TRANSFORMATION_PROJECT_ID=*transformation_project_id*
```
## Create BQ table
Those steps should be done as DWH Service Account.
You can run the command to create a table:
```bash
gcloud --impersonate-service-account=sa-datawh@$DWH_PROJECT_ID.iam.gserviceaccount.com \
alpha bq tables create person \
--project=$DWH_PROJECT_ID --dataset=bq_raw_dataset \
--description "This is a Test Person table" \
--schema name=STRING,surname=STRING,timestamp=TIMESTAMP
```
## Produce CSV data file, JSON schema file and UDF JS file
Those steps should be done as landing Service Account:
Let's now create a series of messages we can use to import:
```bash
for i in {0..10}
do
echo "Lorenzo,Caggioni,$(date +%s)" >> person.csv
done
```
and copy files to the GCS bucket:
```bash
gsutil -i sa-landing@$LANDING_PROJECT_ID.iam.gserviceaccount.com cp person.csv gs://$LANDING_PROJECT_ID-eu-raw-data
```
Let's create the data JSON schema:
```bash
cat <<'EOF' >> person_schema.json
{
"BigQuery Schema": [
{
"name": "name",
"type": "STRING"
},
{
"name": "surname",
"type": "STRING"
},
{
"name": "timestamp",
"type": "TIMESTAMP"
}
]
}
EOF
```
and copy files to the GCS bucket:
```bash
gsutil -i sa-landing@$LANDING_PROJECT_ID.iam.gserviceaccount.com cp person_schema.json gs://$LANDING_PROJECT_ID-eu-data-schema
```
Let's create the data UDF function to transform message data:
```bash
cat <<'EOF' >> person_udf.js
function transform(line) {
var values = line.split(',');
var obj = new Object();
obj.name = values[0];
obj.surname = values[1];
obj.timestamp = values[2];
var jsonString = JSON.stringify(obj);
return jsonString;
}
EOF
```
and copy files to the GCS bucket:
```bash
gsutil -i sa-landing@$LANDING_PROJECT_ID.iam.gserviceaccount.com cp person_udf.js gs://$LANDING_PROJECT_ID-eu-data-schema
```
if you want to check files copied to GCS, you can use the Transformation service account:
```bash
gsutil -i sa-transformation@$TRANSFORMATION_PROJECT_ID.iam.gserviceaccount.com ls gs://$LANDING_PROJECT_ID-eu-raw-data
gsutil -i sa-transformation@$TRANSFORMATION_PROJECT_ID.iam.gserviceaccount.com ls gs://$LANDING_PROJECT_ID-eu-data-schema
```
## Dataflow
Those steps should be done as transformation Service Account.
Let's than start a Dataflow batch pipeline using a Google provided template using internal only IPs, the created network and subnetwork, the appropriate service account and requested parameters:
```bash
gcloud --impersonate-service-account=sa-transformation@$TRANSFORMATION_PROJECT_ID.iam.gserviceaccount.com dataflow jobs run test_batch_01 \
--gcs-location gs://dataflow-templates/latest/GCS_Text_to_BigQuery \
--project $TRANSFORMATION_PROJECT_ID \
--region europe-west3 \
--disable-public-ips \
--network transformation-vpc \
--subnetwork regions/europe-west3/subnetworks/transformation-subnet \
--staging-location gs://$TRANSFORMATION_PROJECT_ID-eu-temp \
--service-account-email sa-transformation@$TRANSFORMATION_PROJECT_ID.iam.gserviceaccount.com \
--parameters \
javascriptTextTransformFunctionName=transform,\
JSONPath=gs://$LANDING_PROJECT_ID-eu-data-schema/person_schema.json,\
javascriptTextTransformGcsPath=gs://$LANDING_PROJECT_ID-eu-data-schema/person_udf.js,\
inputFilePattern=gs://$LANDING_PROJECT_ID-eu-raw-data/person.csv,\
outputTable=$DWH_PROJECT_ID:bq_raw_dataset.person,\
bigQueryLoadingTemporaryDirectory=gs://$TRANSFORMATION_PROJECT_ID-eu-temp
```

View File

@ -1,75 +0,0 @@
# Manual pipeline Example: PubSub to Bigquery
In this example we will publish person message in the following format:
```txt
name: Name
surname: Surname
timestamp: 1617898199
```
a Dataflow pipeline will read those messages and import them into a Bigquery table in the DWH project.
An autorized view will be created in the datamart project to expose the table.
[TODO] Further automation is expected in future.
## Set up the env vars
```bash
export DWH_PROJECT_ID=**dwh_project_id**
export LANDING_PROJECT_ID=**landing_project_id**
export TRANSFORMATION_PROJECT_ID=*transformation_project_id*
```
## Create BQ table
Those steps should be done as DWH Service Account.
You can run the command to create a table:
```bash
gcloud --impersonate-service-account=sa-datawh@$DWH_PROJECT_ID.iam.gserviceaccount.com \
alpha bq tables create person \
--project=$DWH_PROJECT_ID --dataset=bq_raw_dataset \
--description "This is a Test Person table" \
--schema name=STRING,surname=STRING,timestamp=TIMESTAMP
```
## Produce PubSub messages
Those steps should be done as landing Service Account:
Let's now create a series of messages we can use to import:
```bash
for i in {0..10}
do
gcloud --impersonate-service-account=sa-landing@$LANDING_PROJECT_ID.iam.gserviceaccount.com pubsub topics publish projects/$LANDING_PROJECT_ID/topics/landing-1 --message="{\"name\": \"Lorenzo\", \"surname\": \"Caggioni\", \"timestamp\": \"$(date +%s)\"}"
done
```
if you want to check messages published, you can use the Transformation service account and read a message (message won't be acked and will stay in the subscription):
```bash
gcloud --impersonate-service-account=sa-transformation@$TRANSFORMATION_PROJECT_ID.iam.gserviceaccount.com pubsub subscriptions pull projects/$LANDING_PROJECT_ID/subscriptions/sub1
```
## Dataflow
Those steps should be done as transformation Service Account:
Let's than start a Dataflow streaming pipeline using a Google provided template using internal only IPs, the created network and subnetwork, the appropriate service account and requested parameters:
```bash
gcloud dataflow jobs run test_streaming01 \
--gcs-location gs://dataflow-templates/latest/PubSub_Subscription_to_BigQuery \
--project $TRANSFORMATION_PROJECT_ID \
--region europe-west3 \
--disable-public-ips \
--network transformation-vpc \
--subnetwork regions/europe-west3/subnetworks/transformation-subnet \
--staging-location gs://$TRANSFORMATION_PROJECT_ID-eu-temp \
--service-account-email sa-transformation@$TRANSFORMATION_PROJECT_ID.iam.gserviceaccount.com \
--parameters \
inputSubscription=projects/$LANDING_PROJECT_ID/subscriptions/sub1,\
outputTableSpec=$DWH_PROJECT_ID:bq_raw_dataset.person
```

View File

@ -1,26 +0,0 @@
{
"schema": {
"fields": [
{
"mode": "NULLABLE",
"name": "name",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "surname",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "age",
"type": "INTEGER"
},
{
"mode": "NULLABLE",
"name": "boolean_val",
"type": "BOOLEAN"
}
]
}
}

View File

@ -1,61 +1,251 @@
# Data Foundation Platform
# Data Platform
The goal of this example is to Build a robust and flexible Data Foundation on GCP, providing opinionated defaults while still allowing customers to quickly and reliably build and scale out additional data pipelines.
This module implements an opinionated Data Platform (DP) Architecture that creates and setup projects and related resources that compose an end-to-end data environment.
The example is composed of three separate provisioning workflows, which are deisgned to be plugged together and create end to end Data Foundations, that support multiple data pipelines on top.
The code is intentionally simple, as it's intended to provide a generic initial setup and then allow easy customizations to complete the implementation of the intended design.
1. **[Environment Setup](./01-environment/)**
*(once per environment)*
* projects
* VPC configuration
* Composer environment and identity
* shared buckets and datasets
1. **[Data Source Setup](./02-resources)**
*(once per data source)*
* landing and archive bucket
* internal and external identities
* domain specific datasets
1. **[Pipeline Setup](./03-pipeline)**
*(once per pipeline)*
* pipeline-specific tables and views
* pipeline code
* Composer DAG
The following diagram is a high-level reference of the resources created and managed here:
The resulting GCP architecture is outlined in this diagram
![Target architecture](./02-resources/diagram.png)
![Data Platform architecture overview](./images/overview_diagram.png "Data Platform architecture overview")
A demo pipeline is also part of this example: it can be built and run on top of the foundational infrastructure to quickly verify or test the setup.
A demo pipeline is also part of this example: it can be built and run on top of the foundational infrastructure to verify or test the setup quickly.
## Prerequisites
## Design overview and choices
In order to bring up this example, you will need
Despite its simplicity, this stage implements the basics of a design that we've seen working well for various customers.
The approach adapts to different high-level requirements:
- boundaries for each step
- clearly defined actors
- least privilege principle
- rely on service account impersonation
The code in this example doesn't address Organization-level configurations (Organization policy, VPC-SC, centralized logs). We expect those to be managed by automation stages external to this script like those in [FAST](../../../fast).
### Project structure
The DP is designed to rely on several projects, one project per data stage. The stages identified are:
- landing
- load
- data lake
- orchestration
- transformation
- exposure
This separation into projects allows adhering to the least-privilege principle by using project-level roles.
The script will create the following projects:
- **Landing** Used to store temporary data. Data is pushed to Cloud Storage, BigQuery, or Cloud PubSub. Resources are configured with a customizable lifecycle policy.
- **Load** Used to load data from landing to data lake. The load is made with minimal to zero transformation logic (mainly `cast`). Anonymization or tokenization of Personally Identifiable Information (PII) can be implemented here or in the transformation stage, depending on your requirements. The use of [Cloud Dataflow templates](https://cloud.google.com/dataflow/docs/concepts/dataflow-templates) is recommended.
- **Data Lake** Several projects distributed across 3 separate layers, to host progressively processed and refined data:
- **L0 - Raw data** Structured Data, stored in relevant formats: structured data stored in BigQuery, unstructured data stored on Cloud Storage with additional metadata stored in BigQuery (for example pictures stored in Cloud Storage and analysis of the images for Cloud Vision API stored in BigQuery).
- **L1 - Cleansed, aggregated and standardized data**
- **L2 - Curated layer**
- **Playground** Temporary tables that Data Analyst may use to perform R&D on data available in other Data Lake layers.
- **Orchestration** Used to host Cloud Composer, which orchestrates all tasks that move data across layers.
- **Transformation** Used to move data between Data Lake layers. We strongly suggest relying on BigQuery Engine to perform the transformations. If BigQuery doesn't have the features needed to perform your transformations, you can use Cloud Dataflow with [Cloud Dataflow templates](https://cloud.google.com/dataflow/docs/concepts/dataflow-templates). This stage can also optionally anonymize or tokenize PII.
- **Exposure** Used to host resources that share processed data with external systems. Depending on the access pattern, data can be presented via Cloud SQL, BigQuery, or Bigtable. For BigQuery data, we strongly suggest relying on [Authorized views](https://cloud.google.com/bigquery/docs/authorized-views).
### Roles
We assign roles on resources at the project level, granting the appropriate role via groups for humans and individual principals for service accounts, according to best practices.
### Service accounts
Service account creation follows the least privilege principle, performing a single task which requires access to a defined set of resources. For example, the Cloud Dataflow service account only has access to the landing project and the data lake L0 project.
Using of service account keys within a data pipeline exposes to several security risks deriving from a credentials leak. This example shows how to leverage impersonation to avoid the need of creating keys.
### Groups
We use three groups to control access to resources:
- *Data Engineers* They handle and run the Data Hub, with read access to all resources in order to troubleshoot possible issues with pipelines. This team can also impersonate any service account.
- *Data Analyst*. They perform analysis on datasets, with read access to the data lake L2 project, and BigQuery READ/WRITE access to the playground project. - *Data Security*:. They handle security configurations related to the Data Hub.
### Virtual Private Cloud (VPC) design
As is often the case in real-world configurations, this example accepts as input an existing [Shared-VPC](https://cloud.google.com/vpc/docs/shared-vpc) via the `network_config` variable.
If the `network_config` variable is not provided, one VPC will be created in each project that supports network resources (load, transformation and orchestration).
### IP ranges and subnetting
To deploy this example with self-managed VPCs you need the following ranges:
- one /24 for the load project VPC subnet used for Cloud Dataflow workers
- one /24 for the transformation VPC subnet used for Cloud Dataflow workers
- one /24 range for the orchestration VPC subnet used for Composer workers
- one /22 and one /24 ranges for the secondary ranges associated with the orchestration VPC subnet
If you are using Shared VPC, you need one subnet with one /22 and one /24 secondary range defined for Composer pods and services.
In both VPC scenarios, you also need these ranges for Composer:
- one /24 for Cloud SQL
- one /28 for the GKE control plane
- one /28 for the web server
### Resource naming conventions
Resources in the script use the following acronyms:
- `lnd` for `landing`
- `lod` for `load`
- `orc` for `orchestration`
- `trf` for `transformation`
- `dtl` for `Data Lake`
- `cmn` for `common`
- `plg` for `playground`
- 2 letters acronym for GCP products, example: `bq` for `BigQuery`, `df` for `Cloud Dataflow`, ...
Resources follow the naming convention described below.
- `prefix-layer` for projects
- `prefix-layer[2]-gcp-product[2]-counter` for services and service accounts
### Encryption
We suggest a centralized approach to key management, where Security is the only team that can access encryption material, and keyrings and keys are managed in a project external to the DP.
![Centralized Cloud Key Management high-level diagram](./images/kms_diagram.png "Centralized Cloud Key Management high-level diagram")
To configure the use of Cloud KMS on resources, you have to specify the key id on the `service_encryption_keys` variable. Key locations should match resource locations. Example:
```hcl
service_encryption_keys = {
bq = "KEY_URL_MULTIREGIONAL"
composer = "KEY_URL_REGIONAL"
dataflow = "KEY_URL_REGIONAL"
storage = "KEY_URL_MULTIREGIONAL"
pubsub = "KEY_URL_MULTIREGIONAL"
```
This step is optional and depends on customer policies and security best practices.
## Data Anonymization
We suggest using Cloud Data Loss Prevention to identify/mask/tokenize your confidential data.
While implementing a Data Loss Prevention strategy is out of scope for this example, we enable the service in two different projects so that [Cloud Data Loss Prevention templates](https://cloud.google.com/dlp/docs/concepts-templates) can be configured in one of two ways:
- during the ingestion phase, from Dataflow
- during the transformation phase, from [BigQuery](https://cloud.google.com/bigquery/docs/scan-with-dlp) or [Cloud Dataflow](https://cloud.google.com/architecture/running-automated-dataflow-pipeline-de-identify-pii-dataset)
Cloud Data Loss Prevention resources and templates should be stored in the security project:
![Centralized Cloud Data Loss Prevention high-level diagram](./images/dlp_diagram.png "Centralized Cloud Data Loss Prevention high-level diagram")
## How to run this script
To deploy this example on your GCP organization, you will need
- a folder or organization where new projects will be created
- a billing account that will be associated to new projects
- an identity (user or service account) with owner permissions on the folder or org, and billing user permissions on the billing account
- a billing account that will be associated with the new projects
## Bringing up the platform
The DP is meant to be executed by a Service Account (or a regular user) having this minimal set of permission:
[![Open in Cloud Shell](https://gstatic.com/cloudssh/images/open-btn.svg)](https://ssh.cloud.google.com/cloudshell/editor?cloudshell_git_repo=https%3A%2F%2Fgithub.com%2Fterraform-google-modules%2Fcloud-foundation-fabric.git&cloudshell_open_in_editor=README.md&cloudshell_workspace=examples%2Fdata-solutions%2Fdata-platform-foundations)
- **Org level**:
- `"compute.organizations.enableXpnResource"`
- `"compute.organizations.disableXpnResource"`
- `"compute.subnetworks.setIamPolicy"`
- **Folder level**:
- `"roles/logging.admin"`
- `"roles/owner"`
- `"roles/resourcemanager.folderAdmin"`
- `"roles/resourcemanager.projectCreator"`
- **Cloud Key Management Keys** (if Cloud Key Management keys are configured):
- `"roles/cloudkms.admin"` or Permissions: `cloudkms.cryptoKeys.getIamPolicy`, `cloudkms.cryptoKeys.list`, `cloudkms.cryptoKeys.setIamPolicy`
- **On the host project** for the Shared VPC/s
- `"roles/browser"`
- `"roles/compute.viewer"`
- `"roles/dns.admin"`
The end-to-end example is composed of 2 foundational, and 1 optional steps:
## Variable configuration
1. [Environment setup](./01-environment/)
1. [Data source setup](./02-resources/)
1. (Optional) [Pipeline setup](./03-pipeline/)
There are three sets of variables you will need to fill in:
The environment setup is designed to manage a single environment. Various strategies like workspaces, branching, or even separate clones can be used to support multiple environments.
```hcl
prefix = "myco"
project_create = {
parent = "folders/123456789012"
billing_account_id = "111111-222222-333333"
}
organization = {
domain = "domain.com"
}
```
## TODO
For more fine details check variables on [`variables.tf`](./variables.tf) and update according to the desired configuration.
| Description | Priority (1:High - 5:Low ) | Status | Remarks |
|-------------|----------|:------:|---------|
| DLP best practices in the pipeline | 2 | Not Started | |
| Add Composer with a static DAG running the example | 3 | Not Started | |
| Integrate [CI/CD composer data processing workflow framework](https://github.com/jaketf/ci-cd-for-data-processing-workflow) | 3 | Not Started | |
| Schema changes, how to handle | 4 | Not Started | |
| Data lineage | 4 | Not Started | |
| Data quality checks | 4 | Not Started | |
| Shared-VPC | 5 | Not Started | |
| Logging & monitoring | TBD | Not Started | |
| Orcestration for ingestion pipeline (just in the readme) | TBD | Not Started | |
## Customizations
### Create Cloud Key Management keys as part of the DP
To create Cloud Key Management keys in the DP you can uncomment the Cloud Key Management resources configured in the [`06-common.tf`](./06-common.tf) file and update Cloud Key Management keys pointers on `local.service_encryption_keys.*` to the local resource created.
### Assign roles at BQ Dataset level
To handle multiple groups of `data-analysts` accessing the same Data Lake layer projects but only to the dataset belonging to a specific group, you may want to assign roles at BigQuery dataset level instead of at project-level.
To do this, you need to remove IAM binging at project-level for the `data-analysts` group and give roles at BigQuery dataset level using the `iam` variable on `bigquery-dataset` modules.
## Demo pipeline
The application layer is out of scope of this script, but as a demo, it is provided with a Cloud Composer DAG to mode data from the `landing` area to the `DataLake L2` dataset.
Just follow the commands you find in the `demo_commands` Terraform output, go in the Cloud Composer UI and run the `data_pipeline_dag`.
Description of commands:
- 01: copy sample data to a `landing` Cloud Storage bucket impersonating the `load` service account.
- 02: copy sample data structure definition in the `orchestration` Cloud Storage bucket impersonating the `orchestration` service account.
- 03: copy the Cloud Composer DAG to the Cloud Composer Storage bucket impersonating the `orchestration` service account.
- 04: Open the Cloud Composer Airflow UI and run the imported DAG.
- 05: Run the BigQuery query to see results.
<!-- BEGIN TFDOC -->
## Variables
| name | description | type | required | default |
|---|---|:---:|:---:|:---:|
| [organization](variables.tf#L88) | Organization details. | <code title="object&#40;&#123;&#10; domain &#61; string&#10;&#125;&#41;">object&#40;&#123;&#8230;&#125;&#41;</code> | ✓ | |
| [prefix](variables.tf#L95) | Unique prefix used for resource names. Not used for projects if 'project_create' is null. | <code>string</code> | ✓ | |
| [composer_config](variables.tf#L17) | | <code title="object&#40;&#123;&#10; ip_range_cloudsql &#61; string&#10; ip_range_gke_master &#61; string&#10; ip_range_web_server &#61; string&#10; policy_boolean &#61; map&#40;bool&#41;&#10; region &#61; string&#10; secondary_ip_range &#61; object&#40;&#123;&#10; pods &#61; string&#10; services &#61; string&#10; &#125;&#41;&#10;&#125;&#41;">object&#40;&#123;&#8230;&#125;&#41;</code> | | <code title="&#123;&#10; ip_range_cloudsql &#61; &#34;10.20.10.0&#47;24&#34;&#10; ip_range_gke_master &#61; &#34;10.20.11.0&#47;28&#34;&#10; ip_range_web_server &#61; &#34;10.20.11.16&#47;28&#34;&#10; policy_boolean &#61; null&#10; region &#61; &#34;europe-west1&#34;&#10; secondary_ip_range &#61; &#123;&#10; pods &#61; &#34;10.10.8.0&#47;22&#34;&#10; services &#61; &#34;10.10.12.0&#47;24&#34;&#10; &#125;&#10;&#125;">&#123;&#8230;&#125;</code> |
| [data_force_destroy](variables.tf#L42) | Flag to set 'force_destroy' on data services like BiguQery or Cloud Storage. | <code>bool</code> | | <code>false</code> |
| [groups](variables.tf#L48) | Groups. | <code>map&#40;string&#41;</code> | | <code title="&#123;&#10; data-analysts &#61; &#34;gcp-data-analysts&#34;&#10; data-engineers &#61; &#34;gcp-data-engineers&#34;&#10; data-security &#61; &#34;gcp-data-security&#34;&#10;&#125;">&#123;&#8230;&#125;</code> |
| [location_config](variables.tf#L148) | Locations where resources will be deployed. Map to configure region and multiregion specs. | <code title="object&#40;&#123;&#10; region &#61; string&#10; multi_region &#61; string&#10;&#125;&#41;">object&#40;&#123;&#8230;&#125;&#41;</code> | | <code title="&#123;&#10; region &#61; &#34;europe-west1&#34;&#10; multi_region &#61; &#34;eu&#34;&#10;&#125;">&#123;&#8230;&#125;</code> |
| [network_config](variables.tf#L58) | Network configurations to use. Specify a shared VPC to use, if null networks will be created in projects. | <code title="object&#40;&#123;&#10; enable_cloud_nat &#61; bool&#10; host_project &#61; string&#10; network &#61; string&#10; vpc_subnet_range &#61; object&#40;&#123;&#10; load &#61; string&#10; transformation &#61; string&#10; orchestration &#61; string&#10; &#125;&#41;&#10; vpc_subnet_self_link &#61; object&#40;&#123;&#10; load &#61; string&#10; transformation &#61; string&#10; orchestration &#61; string&#10; &#125;&#41;&#10;&#125;&#41;">object&#40;&#123;&#8230;&#125;&#41;</code> | | <code title="&#123;&#10; enable_cloud_nat &#61; false&#10; host_project &#61; null&#10; network &#61; null&#10; vpc_subnet_range &#61; &#123;&#10; load &#61; &#34;10.10.0.0&#47;24&#34;&#10; transformation &#61; &#34;10.10.0.0&#47;24&#34;&#10; orchestration &#61; &#34;10.10.0.0&#47;24&#34;&#10; &#125;&#10; vpc_subnet_self_link &#61; null&#10;&#125;">&#123;&#8230;&#125;</code> |
| [project_create](variables.tf#L100) | Provide values if project creation is needed, uses existing project if null. Parent is in 'folders/nnn' or 'organizations/nnn' format. | <code title="object&#40;&#123;&#10; billing_account_id &#61; string&#10; parent &#61; string&#10;&#125;&#41;">object&#40;&#123;&#8230;&#125;&#41;</code> | | <code>null</code> |
| [project_id](variables.tf#L109) | Project id, references existing project if `project_create` is null. | <code title="object&#40;&#123;&#10; landing &#61; string&#10; load &#61; string&#10; orchestration &#61; string&#10; trasformation &#61; string&#10; datalake-l0 &#61; string&#10; datalake-l1 &#61; string&#10; datalake-l2 &#61; string&#10; datalake-playground &#61; string&#10; common &#61; string&#10; exposure &#61; string&#10;&#125;&#41;">object&#40;&#123;&#8230;&#125;&#41;</code> | | <code title="&#123;&#10; landing &#61; &#34;lnd&#34;&#10; load &#61; &#34;lod&#34;&#10; orchestration &#61; &#34;orc&#34;&#10; trasformation &#61; &#34;trf&#34;&#10; datalake-l0 &#61; &#34;dtl-0&#34;&#10; datalake-l1 &#61; &#34;dtl-1&#34;&#10; datalake-l2 &#61; &#34;dtl-2&#34;&#10; datalake-playground &#61; &#34;dtl-plg&#34;&#10; common &#61; &#34;cmn&#34;&#10; exposure &#61; &#34;exp&#34;&#10;&#125;">&#123;&#8230;&#125;</code> |
| [project_services](variables.tf#L137) | List of core services enabled on all projects. | <code>list&#40;string&#41;</code> | | <code title="&#91;&#10; &#34;cloudresourcemanager.googleapis.com&#34;,&#10; &#34;iam.googleapis.com&#34;,&#10; &#34;serviceusage.googleapis.com&#34;,&#10; &#34;stackdriver.googleapis.com&#34;&#10;&#93;">&#91;&#8230;&#93;</code> |
## Outputs
| name | description | sensitive |
|---|---|:---:|
| [bigquery-datasets](outputs.tf#L17) | BigQuery datasets. | |
| [demo_commands](outputs.tf#L93) | Demo commands. | |
| [gcs-buckets](outputs.tf#L28) | GCS buckets. | |
| [kms_keys](outputs.tf#L42) | Cloud MKS keys. | |
| [projects](outputs.tf#L47) | GCP Projects informations. | |
| [vpc_network](outputs.tf#L75) | VPC network. | |
| [vpc_subnet](outputs.tf#L84) | VPC subnetworks. | |
<!-- END TFDOC -->
## TODOs
Features to add in future releases:
- Add support for Column level access on BigQuery
- Add example templates for Data Catalog
- Add example on how to use Cloud Data Loss Prevention
- Add solution to handle Tables, Views, and Authorized Views lifecycle
- Add solution to handle Metadata lifecycle
## To Test/Fix
- Composer require "Require OS Login" not enforced
- External Shared-VPC

View File

Before

Width:  |  Height:  |  Size: 27 KiB

After

Width:  |  Height:  |  Size: 27 KiB

View File

Before

Width:  |  Height:  |  Size: 20 KiB

After

Width:  |  Height:  |  Size: 20 KiB

View File

@ -1,251 +0,0 @@
# Data Platform
This module implements an opinionated Data Platform (DP) Architecture that creates and setup projects and related resources that compose an end-to-end data environment.
The code is intentionally simple, as it's intended to provide a generic initial setup and then allow easy customizations to complete the implementation of the intended design.
The following diagram is a high-level reference of the resources created and managed here:
![Data Platform architecture overview](./images/overview_diagram.png "Data Platform architecture overview")
A demo pipeline is also part of this example: it can be built and run on top of the foundational infrastructure to verify or test the setup quickly.
## Design overview and choices
Despite its simplicity, this stage implements the basics of a design that we've seen working well for various customers.
The approach adapts to different high-level requirements:
- boundaries for each step
- clearly defined actors
- least privilege principle
- rely on service account impersonation
The code in this example doesn't address Organization-level configurations (Organization policy, VPC-SC, centralized logs). We expect those to be managed by automation stages external to this script like those in [FAST](../../../fast).
### Project structure
The DP is designed to rely on several projects, one project per data stage. The stages identified are:
- landing
- load
- data lake
- orchestration
- transformation
- exposure
This separation into projects allows adhering to the least-privilege principle by using project-level roles.
The script will create the following projects:
- **Landing** Used to store temporary data. Data is pushed to Cloud Storage, BigQuery, or Cloud PubSub. Resources are configured with a customizable lifecycle policy.
- **Load** Used to load data from landing to data lake. The load is made with minimal to zero transformation logic (mainly `cast`). Anonymization or tokenization of Personally Identifiable Information (PII) can be implemented here or in the transformation stage, depending on your requirements. The use of [Cloud Dataflow templates](https://cloud.google.com/dataflow/docs/concepts/dataflow-templates) is recommended.
- **Data Lake** Several projects distributed across 3 separate layers, to host progressively processed and refined data:
- **L0 - Raw data** Structured Data, stored in relevant formats: structured data stored in BigQuery, unstructured data stored on Cloud Storage with additional metadata stored in BigQuery (for example pictures stored in Cloud Storage and analysis of the images for Cloud Vision API stored in BigQuery).
- **L1 - Cleansed, aggregated and standardized data**
- **L2 - Curated layer**
- **Playground** Temporary tables that Data Analyst may use to perform R&D on data available in other Data Lake layers.
- **Orchestration** Used to host Cloud Composer, which orchestrates all tasks that move data across layers.
- **Transformation** Used to move data between Data Lake layers. We strongly suggest relying on BigQuery Engine to perform the transformations. If BigQuery doesn't have the features needed to perform your transformations, you can use Cloud Dataflow with [Cloud Dataflow templates](https://cloud.google.com/dataflow/docs/concepts/dataflow-templates). This stage can also optionally anonymize or tokenize PII.
- **Exposure** Used to host resources that share processed data with external systems. Depending on the access pattern, data can be presented via Cloud SQL, BigQuery, or Bigtable. For BigQuery data, we strongly suggest relying on [Authorized views](https://cloud.google.com/bigquery/docs/authorized-views).
### Roles
We assign roles on resources at the project level, granting the appropriate role via groups for humans and individual principals for service accounts, according to best practices.
### Service accounts
Service account creation follows the least privilege principle, performing a single task which requires access to a defined set of resources. For example, the Cloud Dataflow service account only has access to the landing project and the data lake L0 project.
Using of service account keys within a data pipeline exposes to several security risks deriving from a credentials leak. This example shows how to leverage impersonation to avoid the need of creating keys.
### Groups
We use three groups to control access to resources:
- *Data Engineers* They handle and run the Data Hub, with read access to all resources in order to troubleshoot possible issues with pipelines. This team can also impersonate any service account.
- *Data Analyst*. They perform analysis on datasets, with read access to the data lake L2 project, and BigQuery READ/WRITE access to the playground project. - *Data Security*:. They handle security configurations related to the Data Hub.
### Virtual Private Cloud (VPC) design
As is often the case in real-world configurations, this example accepts as input an existing [Shared-VPC](https://cloud.google.com/vpc/docs/shared-vpc) via the `network_config` variable.
If the `network_config` variable is not provided, one VPC will be created in each project that supports network resources (load, transformation and orchestration).
### IP ranges and subnetting
To deploy this example with self-managed VPCs you need the following ranges:
- one /24 for the load project VPC subnet used for Cloud Dataflow workers
- one /24 for the transformation VPC subnet used for Cloud Dataflow workers
- one /24 range for the orchestration VPC subnet used for Composer workers
- one /22 and one /24 ranges for the secondary ranges associated with the orchestration VPC subnet
If you are using Shared VPC, you need one subnet with one /22 and one /24 secondary range defined for Composer pods and services.
In both VPC scenarios, you also need these ranges for Composer:
- one /24 for Cloud SQL
- one /28 for the GKE control plane
- one /28 for the web server
### Resource naming conventions
Resources in the script use the following acronyms:
- `lnd` for `landing`
- `lod` for `load`
- `orc` for `orchestration`
- `trf` for `transformation`
- `dtl` for `Data Lake`
- `cmn` for `common`
- `plg` for `playground`
- 2 letters acronym for GCP products, example: `bq` for `BigQuery`, `df` for `Cloud Dataflow`, ...
Resources follow the naming convention described below.
- `prefix-layer` for projects
- `prefix-layer[2]-gcp-product[2]-counter` for services and service accounts
### Encryption
We suggest a centralized approach to key management, where Security is the only team that can access encryption material, and keyrings and keys are managed in a project external to the DP.
![Centralized Cloud Key Management high-level diagram](./images/kms_diagram.png "Centralized Cloud Key Management high-level diagram")
To configure the use of Cloud KMS on resources, you have to specify the key id on the `service_encryption_keys` variable. Key locations should match resource locations. Example:
```hcl
service_encryption_keys = {
bq = "KEY_URL_MULTIREGIONAL"
composer = "KEY_URL_REGIONAL"
dataflow = "KEY_URL_REGIONAL"
storage = "KEY_URL_MULTIREGIONAL"
pubsub = "KEY_URL_MULTIREGIONAL"
```
This step is optional and depends on customer policies and security best practices.
## Data Anonymization
We suggest using Cloud Data Loss Prevention to identify/mask/tokenize your confidential data.
While implementing a Data Loss Prevention strategy is out of scope for this example, we enable the service in two different projects so that [Cloud Data Loss Prevention templates](https://cloud.google.com/dlp/docs/concepts-templates) can be configured in one of two ways:
- during the ingestion phase, from Dataflow
- during the transformation phase, from [BigQuery](https://cloud.google.com/bigquery/docs/scan-with-dlp) or [Cloud Dataflow](https://cloud.google.com/architecture/running-automated-dataflow-pipeline-de-identify-pii-dataset)
Cloud Data Loss Prevention resources and templates should be stored in the security project:
![Centralized Cloud Data Loss Prevention high-level diagram](./images/dlp_diagram.png "Centralized Cloud Data Loss Prevention high-level diagram")
## How to run this script
To deploy this example on your GCP organization, you will need
- a folder or organization where new projects will be created
- a billing account that will be associated with the new projects
The DP is meant to be executed by a Service Account (or a regular user) having this minimal set of permission:
- **Org level**:
- `"compute.organizations.enableXpnResource"`
- `"compute.organizations.disableXpnResource"`
- `"compute.subnetworks.setIamPolicy"`
- **Folder level**:
- `"roles/logging.admin"`
- `"roles/owner"`
- `"roles/resourcemanager.folderAdmin"`
- `"roles/resourcemanager.projectCreator"`
- **Cloud Key Management Keys** (if Cloud Key Management keys are configured):
- `"roles/cloudkms.admin"` or Permissions: `cloudkms.cryptoKeys.getIamPolicy`, `cloudkms.cryptoKeys.list`, `cloudkms.cryptoKeys.setIamPolicy`
- **On the host project** for the Shared VPC/s
- `"roles/browser"`
- `"roles/compute.viewer"`
- `"roles/dns.admin"`
## Variable configuration
There are three sets of variables you will need to fill in:
```hcl
prefix = "myco"
project_create = {
parent = "folders/123456789012"
billing_account_id = "111111-222222-333333"
}
organization = {
domain = "domain.com"
}
```
For more fine details check variables on [`variables.tf`](./variables.tf) and update according to the desired configuration.
## Customizations
### Create Cloud Key Management keys as part of the DP
To create Cloud Key Management keys in the DP you can uncomment the Cloud Key Management resources configured in the [`06-common.tf`](./06-common.tf) file and update Cloud Key Management keys pointers on `local.service_encryption_keys.*` to the local resource created.
### Assign roles at BQ Dataset level
To handle multiple groups of `data-analysts` accessing the same Data Lake layer projects but only to the dataset belonging to a specific group, you may want to assign roles at BigQuery dataset level instead of at project-level.
To do this, you need to remove IAM binging at project-level for the `data-analysts` group and give roles at BigQuery dataset level using the `iam` variable on `bigquery-dataset` modules.
## Demo pipeline
The application layer is out of scope of this script, but as a demo, it is provided with a Cloud Composer DAG to mode data from the `landing` area to the `DataLake L2` dataset.
Just follow the commands you find in the `demo_commands` Terraform output, go in the Cloud Composer UI and run the `data_pipeline_dag`.
Description of commands:
- 01: copy sample data to a `landing` Cloud Storage bucket impersonating the `load` service account.
- 02: copy sample data structure definition in the `orchestration` Cloud Storage bucket impersonating the `orchestration` service account.
- 03: copy the Cloud Composer DAG to the Cloud Composer Storage bucket impersonating the `orchestration` service account.
- 04: Open the Cloud Composer Airflow UI and run the imported DAG.
- 05: Run the BigQuery query to see results.
<!-- BEGIN TFDOC -->
## Variables
| name | description | type | required | default |
|---|---|:---:|:---:|:---:|
| [organization](variables.tf#L88) | Organization details. | <code title="object&#40;&#123;&#10; domain &#61; string&#10;&#125;&#41;">object&#40;&#123;&#8230;&#125;&#41;</code> | ✓ | |
| [prefix](variables.tf#L95) | Unique prefix used for resource names. Not used for projects if 'project_create' is null. | <code>string</code> | ✓ | |
| [composer_config](variables.tf#L17) | | <code title="object&#40;&#123;&#10; ip_range_cloudsql &#61; string&#10; ip_range_gke_master &#61; string&#10; ip_range_web_server &#61; string&#10; policy_boolean &#61; map&#40;bool&#41;&#10; region &#61; string&#10; secondary_ip_range &#61; object&#40;&#123;&#10; pods &#61; string&#10; services &#61; string&#10; &#125;&#41;&#10;&#125;&#41;">object&#40;&#123;&#8230;&#125;&#41;</code> | | <code title="&#123;&#10; ip_range_cloudsql &#61; &#34;10.20.10.0&#47;24&#34;&#10; ip_range_gke_master &#61; &#34;10.20.11.0&#47;28&#34;&#10; ip_range_web_server &#61; &#34;10.20.11.16&#47;28&#34;&#10; policy_boolean &#61; null&#10; region &#61; &#34;europe-west1&#34;&#10; secondary_ip_range &#61; &#123;&#10; pods &#61; &#34;10.10.8.0&#47;22&#34;&#10; services &#61; &#34;10.10.12.0&#47;24&#34;&#10; &#125;&#10;&#125;">&#123;&#8230;&#125;</code> |
| [data_force_destroy](variables.tf#L42) | Flag to set 'force_destroy' on data services like BiguQery or Cloud Storage. | <code>bool</code> | | <code>false</code> |
| [groups](variables.tf#L48) | Groups. | <code>map&#40;string&#41;</code> | | <code title="&#123;&#10; data-analysts &#61; &#34;gcp-data-analysts&#34;&#10; data-engineers &#61; &#34;gcp-data-engineers&#34;&#10; data-security &#61; &#34;gcp-data-security&#34;&#10;&#125;">&#123;&#8230;&#125;</code> |
| [location_config](variables.tf#L148) | Locations where resources will be deployed. Map to configure region and multiregion specs. | <code title="object&#40;&#123;&#10; region &#61; string&#10; multi_region &#61; string&#10;&#125;&#41;">object&#40;&#123;&#8230;&#125;&#41;</code> | | <code title="&#123;&#10; region &#61; &#34;europe-west1&#34;&#10; multi_region &#61; &#34;eu&#34;&#10;&#125;">&#123;&#8230;&#125;</code> |
| [network_config](variables.tf#L58) | Network configurations to use. Specify a shared VPC to use, if null networks will be created in projects. | <code title="object&#40;&#123;&#10; enable_cloud_nat &#61; bool&#10; host_project &#61; string&#10; network &#61; string&#10; vpc_subnet_range &#61; object&#40;&#123;&#10; load &#61; string&#10; transformation &#61; string&#10; orchestration &#61; string&#10; &#125;&#41;&#10; vpc_subnet_self_link &#61; object&#40;&#123;&#10; load &#61; string&#10; transformation &#61; string&#10; orchestration &#61; string&#10; &#125;&#41;&#10;&#125;&#41;">object&#40;&#123;&#8230;&#125;&#41;</code> | | <code title="&#123;&#10; enable_cloud_nat &#61; false&#10; host_project &#61; null&#10; network &#61; null&#10; vpc_subnet_range &#61; &#123;&#10; load &#61; &#34;10.10.0.0&#47;24&#34;&#10; transformation &#61; &#34;10.10.0.0&#47;24&#34;&#10; orchestration &#61; &#34;10.10.0.0&#47;24&#34;&#10; &#125;&#10; vpc_subnet_self_link &#61; null&#10;&#125;">&#123;&#8230;&#125;</code> |
| [project_create](variables.tf#L100) | Provide values if project creation is needed, uses existing project if null. Parent is in 'folders/nnn' or 'organizations/nnn' format. | <code title="object&#40;&#123;&#10; billing_account_id &#61; string&#10; parent &#61; string&#10;&#125;&#41;">object&#40;&#123;&#8230;&#125;&#41;</code> | | <code>null</code> |
| [project_id](variables.tf#L109) | Project id, references existing project if `project_create` is null. | <code title="object&#40;&#123;&#10; landing &#61; string&#10; load &#61; string&#10; orchestration &#61; string&#10; trasformation &#61; string&#10; datalake-l0 &#61; string&#10; datalake-l1 &#61; string&#10; datalake-l2 &#61; string&#10; datalake-playground &#61; string&#10; common &#61; string&#10; exposure &#61; string&#10;&#125;&#41;">object&#40;&#123;&#8230;&#125;&#41;</code> | | <code title="&#123;&#10; landing &#61; &#34;lnd&#34;&#10; load &#61; &#34;lod&#34;&#10; orchestration &#61; &#34;orc&#34;&#10; trasformation &#61; &#34;trf&#34;&#10; datalake-l0 &#61; &#34;dtl-0&#34;&#10; datalake-l1 &#61; &#34;dtl-1&#34;&#10; datalake-l2 &#61; &#34;dtl-2&#34;&#10; datalake-playground &#61; &#34;dtl-plg&#34;&#10; common &#61; &#34;cmn&#34;&#10; exposure &#61; &#34;exp&#34;&#10;&#125;">&#123;&#8230;&#125;</code> |
| [project_services](variables.tf#L137) | List of core services enabled on all projects. | <code>list&#40;string&#41;</code> | | <code title="&#91;&#10; &#34;cloudresourcemanager.googleapis.com&#34;,&#10; &#34;iam.googleapis.com&#34;,&#10; &#34;serviceusage.googleapis.com&#34;,&#10; &#34;stackdriver.googleapis.com&#34;&#10;&#93;">&#91;&#8230;&#93;</code> |
## Outputs
| name | description | sensitive |
|---|---|:---:|
| [bigquery-datasets](outputs.tf#L17) | BigQuery datasets. | |
| [demo_commands](outputs.tf#L93) | Demo commands. | |
| [gcs-buckets](outputs.tf#L28) | GCS buckets. | |
| [kms_keys](outputs.tf#L42) | Cloud MKS keys. | |
| [projects](outputs.tf#L47) | GCP Projects informations. | |
| [vpc_network](outputs.tf#L75) | VPC network. | |
| [vpc_subnet](outputs.tf#L84) | VPC subnetworks. | |
<!-- END TFDOC -->
## TODOs
Features to add in future releases:
- Add support for Column level access on BigQuery
- Add example templates for Data Catalog
- Add example on how to use Cloud Data Loss Prevention
- Add solution to handle Tables, Views, and Authorized Views lifecycle
- Add solution to handle Metadata lifecycle
## To Test/Fix
- Composer require "Require OS Login" not enforced
- External Shared-VPC