Merge pull request #242 from terraform-google-modules/add-data-platform-foundations

Add data platform foundations
This commit is contained in:
lcaggio 2021-06-15 17:39:53 +02:00 committed by GitHub
commit 381b532c0c
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
22 changed files with 1204 additions and 0 deletions

View File

@ -14,3 +14,11 @@ They are meant to be used as minimal but complete starting points to create actu
### Cloud Storage to Bigquery with Cloud Dataflow
<a href="./gcs-to-bq-with-dataflow/" title="Cloud Storage to Bigquery with Cloud Dataflow"><img src="./gcs-to-bq-with-dataflow/diagram.png" align="left" width="280px"></a> This [example](./gcs-to-bq-with-dataflow/) implements [Cloud Storage](https://cloud.google.com/kms/docs/cmek) to Bigquery data import using Cloud Dataflow.
All resources use CMEK hosted in Cloud KMS running in a centralized project. The example shows the basic resources and permissions for the typical use case to read, transform and import data from Cloud Storage to Bigquery.
<br clear="left">
### Data Platform Foundations
<a href="./data-platform-foundations/" title="Data Platform Foundations"><img src="./data-platform-foundations/02-resources/diagram.png" align="left" width="280px"></a>
This [example](./data-platform-foundations/) implements a robust and flexible Data Foundation on GCP that provides opinionated defaults, allowing customers to build and scale out additional data pipelines quickly and reliably.
<br clear="left">

View File

@ -0,0 +1,53 @@
# Data Platform Foundations - Environment (Step 1)
This is the first step needed to deploy Data Platform Foundations, which creates projects and service accounts. Please refer to the [top-level Data Platform README](../README.md) for prerequisites.
The projects that will be created are:
- Common services
- Landing
- Orchestration & Transformation
- DWH
- Datamart
A main service account named `projects-editor-sa` will be created under the common services project, and it will be granted editor permissions on all the projects in scope.
This is a high level diagram of the created resources:
![Environment - Phase 1](./diagram.png "High-level Environment diagram")
## Running the example
To create the infrastructure:
- specify your variables in a `terraform.tvars`
```tfm
billing_account = "1234-1234-1234"
parent = "folders/12345678"
```
- make sure you have the right authentication setup (application default credentials, or a service account key)
- **The output of this stage contains the values for the resources stage**
- run `terraform init` and `terraform apply`
Once done testing, you can clean up resources by running `terraform destroy`.
<!-- BEGIN TFDOC -->
## Variables
| name | description | type | required | default |
|---|---|:---: |:---:|:---:|
| billing_account_id | Billing account id. | <code title="">string</code> | ✓ | |
| root_node | Parent folder or organization in 'folders/folder_id' or 'organizations/org_id' format. | <code title="">string</code> | ✓ | |
| *prefix* | Prefix used to generate project id and name. | <code title="">string</code> | | <code title="">null</code> |
| *project_names* | Override this variable if you need non-standard names. | <code title="object&#40;&#123;&#10;datamart &#61; string&#10;dwh &#61; string&#10;landing &#61; string&#10;services &#61; string&#10;transformation &#61; string&#10;&#125;&#41;">object({...})</code> | | <code title="&#123;&#10;datamart &#61; &#34;datamart&#34;&#10;dwh &#61; &#34;datawh&#34;&#10;landing &#61; &#34;landing&#34;&#10;services &#61; &#34;services&#34;&#10;transformation &#61; &#34;transformation&#34;&#10;&#125;">...</code> |
| *service_account_names* | Override this variable if you need non-standard names. | <code title="object&#40;&#123;&#10;main &#61; string&#10;&#125;&#41;">object({...})</code> | | <code title="&#123;&#10;main &#61; &#34;data-platform-main&#34;&#10;&#125;">...</code> |
## Outputs
| name | description | sensitive |
|---|---|:---:|
| project_ids | Project ids for created projects. | |
| service_account | Main service account. | |
<!-- END TFDOC -->

Binary file not shown.

After

Width:  |  Height:  |  Size: 275 KiB

View File

@ -0,0 +1,115 @@
/**
* Copyright 2020 Google LLC
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
###############################################################################
# projects #
###############################################################################
module "project-datamart" {
source = "../../../modules/project"
parent = var.root_node
billing_account = var.billing_account_id
prefix = var.prefix
name = var.project_names.datamart
services = [
"bigtable.googleapis.com",
"bigtableadmin.googleapis.com",
"bigquery.googleapis.com",
"bigquerystorage.googleapis.com",
"bigqueryreservation.googleapis.com",
"storage-component.googleapis.com",
]
iam = {
"roles/editor" = [module.sa-services-main.iam_email]
}
}
module "project-dwh" {
source = "../../../modules/project"
parent = var.root_node
billing_account = var.billing_account_id
prefix = var.prefix
name = var.project_names.dwh
services = [
"bigquery.googleapis.com",
"bigquerystorage.googleapis.com",
"bigqueryreservation.googleapis.com",
"storage-component.googleapis.com",
]
iam = {
"roles/editor" = [module.sa-services-main.iam_email]
}
}
module "project-landing" {
source = "../../../modules/project"
parent = var.root_node
billing_account = var.billing_account_id
prefix = var.prefix
name = var.project_names.landing
services = [
"pubsub.googleapis.com",
"storage-component.googleapis.com",
]
iam = {
"roles/editor" = [module.sa-services-main.iam_email]
}
}
module "project-services" {
source = "../../../modules/project"
parent = var.root_node
billing_account = var.billing_account_id
prefix = var.prefix
name = var.project_names.services
services = [
"storage-component.googleapis.com",
"sourcerepo.googleapis.com",
"stackdriver.googleapis.com",
"cloudasset.googleapis.com",
]
iam = {
"roles/editor" = [module.sa-services-main.iam_email]
}
}
module "project-transformation" {
source = "../../../modules/project"
parent = var.root_node
billing_account = var.billing_account_id
prefix = var.prefix
name = var.project_names.transformation
services = [
"cloudbuild.googleapis.com",
"compute.googleapis.com",
"dataflow.googleapis.com",
"servicenetworking.googleapis.com",
"storage-component.googleapis.com",
]
iam = {
"roles/editor" = [module.sa-services-main.iam_email]
}
}
###############################################################################
# service accounts #
###############################################################################
module "sa-services-main" {
source = "../../../modules/iam-service-account"
project_id = module.project-services.project_id
name = var.service_account_names.main
}

View File

@ -0,0 +1,31 @@
/**
* Copyright 2020 Google LLC
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
output "project_ids" {
description = "Project ids for created projects."
value = {
datamart = module.project-datamart.project_id
dwh = module.project-dwh.project_id
landing = module.project-landing.project_id
services = module.project-services.project_id
transformation = module.project-transformation.project_id
}
}
output "service_account" {
description = "Main service account."
value = module.sa-services-main.email
}

View File

@ -0,0 +1,57 @@
# Copyright 2020 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
variable "billing_account_id" {
description = "Billing account id."
type = string
}
variable "prefix" {
description = "Prefix used to generate project id and name."
type = string
default = null
}
variable "project_names" {
description = "Override this variable if you need non-standard names."
type = object({
datamart = string
dwh = string
landing = string
services = string
transformation = string
})
default = {
datamart = "datamart"
dwh = "datawh"
landing = "landing"
services = "services"
transformation = "transformation"
}
}
variable "root_node" {
description = "Parent folder or organization in 'folders/folder_id' or 'organizations/org_id' format."
type = string
}
variable "service_account_names" {
description = "Override this variable if you need non-standard names."
type = object({
main = string
})
default = {
main = "data-platform-main"
}
}

View File

@ -0,0 +1,17 @@
# Copyright 2020 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
terraform {
required_version = ">= 0.13"
}

View File

@ -0,0 +1,78 @@
# Data Platform Foundations - Resources (Step 2)
This is the second step needed to deploy Data Platform Foundations, which creates resources needed to store and process the data, in the projects created in the [previous step](./../environment/). Please refer to the [top-level README](../README.md) for prerequisites and how to run the first step.
![Data Foundation - Phase 2](./diagram.png "High-level diagram")
The resources that will be create in each project are:
- Common
- Landing
- [x] GCS
- [x] Pub/Sub
- Orchestration & Transformation
- [x] Dataflow
- DWH
- [x] Bigquery (L0/1/2)
- [x] GCS
- Datamart
- [x] Bigquery (views/table)
- [x] GCS
- [ ] BigTable
## Running the example
In the previous step, we created the environment (projects and service account) which we are going to use in this step.
To create the resources, copy the output of the environment step (**project_ids**) and paste it into the `terraform.tvars`:
- Specify your variables in a `terraform.tvars`, you can use the ouptu from the environment stage
```tfm
project_ids = {
datamart = "datamart-project_id"
dwh = "dwh-project_id"
landing = "landing-project_id"
services = "services-project_id"
transformation = "transformation-project_id"
}
```
- Get a key for the service account created in the environment stage:
- Go into services project
- Go into IAM page
- Go into the service account section
- Creaet a new key for the service account created in previeous step (**service_account**)
- Download the json key into the current folder
- make sure you have the right authentication setup: `export GOOGLE_APPLICATION_CREDENTIALS=PATH_TO_SERVICE_ACCOUT_KEY.json`
- run `terraform init` and `terraform apply`
Once done testing, you can clean up resources by running `terraform destroy`.
<!-- BEGIN TFDOC -->
## Variables
| name | description | type | required | default |
|---|---|:---: |:---:|:---:|
| project_ids | Project IDs. | <code title="object&#40;&#123;&#10;datamart &#61; string&#10;dwh &#61; string&#10;landing &#61; string&#10;services &#61; string&#10;transformation &#61; string&#10;&#125;&#41;">object({...})</code> | ✓ | |
| *datamart_bq_datasets* | Datamart Bigquery datasets | <code title="map&#40;object&#40;&#123;&#10;iam &#61; map&#40;list&#40;string&#41;&#41;&#10;location &#61; string&#10;&#125;&#41;&#41;">map(object({...}))</code> | | <code title="&#123;&#10;bq_datamart_dataset &#61; &#123;&#10;location &#61; &#34;EU&#34;&#10;iam &#61; &#123;&#10;&#125;&#10;&#125;&#10;&#125;">...</code> |
| *dwh_bq_datasets* | DWH Bigquery datasets | <code title="map&#40;object&#40;&#123;&#10;location &#61; string&#10;iam &#61; map&#40;list&#40;string&#41;&#41;&#10;&#125;&#41;&#41;">map(object({...}))</code> | | <code title="&#123;&#10;bq_raw_dataset &#61; &#123;&#10;iam &#61; &#123;&#125;&#10;location &#61; &#34;EU&#34;&#10;&#125;&#10;&#125;">...</code> |
| *landing_buckets* | List of landing buckets to create | <code title="map&#40;object&#40;&#123;&#10;location &#61; string&#10;name &#61; string&#10;&#125;&#41;&#41;">map(object({...}))</code> | | <code title="&#123;&#10;raw-data &#61; &#123;&#10;location &#61; &#34;EU&#34;&#10;name &#61; &#34;raw-data&#34;&#10;&#125;&#10;data-schema &#61; &#123;&#10;location &#61; &#34;EU&#34;&#10;name &#61; &#34;data-schema&#34;&#10;&#125;&#10;&#125;">...</code> |
| *landing_pubsub* | List of landing pubsub topics and subscriptions to create | <code title="map&#40;map&#40;object&#40;&#123;&#10;iam &#61; map&#40;list&#40;string&#41;&#41;&#10;labels &#61; map&#40;string&#41;&#10;options &#61; object&#40;&#123;&#10;ack_deadline_seconds &#61; number&#10;message_retention_duration &#61; number&#10;retain_acked_messages &#61; bool&#10;expiration_policy_ttl &#61; number&#10;&#125;&#41;&#10;&#125;&#41;&#41;&#41;">map(map(object({...})))</code> | | <code title="&#123;&#10;landing-1 &#61; &#123;&#10;sub1 &#61; &#123;&#10;iam &#61; &#123;&#10;&#125;&#10;labels &#61; &#123;&#125;&#10;options &#61; null&#10;&#125;&#10;sub2 &#61; &#123;&#10;iam &#61; &#123;&#125;&#10;labels &#61; &#123;&#125;,&#10;options &#61; null&#10;&#125;,&#10;&#125;&#10;&#125;">...</code> |
| *landing_service_account* | landing service accounts list. | <code title="">string</code> | | <code title="">sa-landing</code> |
| *service_account_names* | Project service accounts list. | <code title="object&#40;&#123;&#10;datamart &#61; string&#10;dwh &#61; string&#10;landing &#61; string&#10;services &#61; string&#10;transformation &#61; string&#10;&#125;&#41;">object({...})</code> | | <code title="&#123;&#10;datamart &#61; &#34;sa-datamart&#34;&#10;dwh &#61; &#34;sa-datawh&#34;&#10;landing &#61; &#34;sa-landing&#34;&#10;services &#61; &#34;sa-services&#34;&#10;transformation &#61; &#34;sa-transformation&#34;&#10;&#125;">...</code> |
| *transformation_buckets* | List of transformation buckets to create | <code title="map&#40;object&#40;&#123;&#10;location &#61; string&#10;name &#61; string&#10;&#125;&#41;&#41;">map(object({...}))</code> | | <code title="&#123;&#10;temp &#61; &#123;&#10;location &#61; &#34;EU&#34;&#10;name &#61; &#34;temp&#34;&#10;&#125;,&#10;templates &#61; &#123;&#10;location &#61; &#34;EU&#34;&#10;name &#61; &#34;templates&#34;&#10;&#125;,&#10;&#125;">...</code> |
| *transformation_subnets* | List of subnets to create in the transformation Project. | <code title="list&#40;object&#40;&#123;&#10;ip_cidr_range &#61; string&#10;name &#61; string&#10;region &#61; string&#10;secondary_ip_range &#61; map&#40;string&#41;&#10;&#125;&#41;&#41;">list(object({...}))</code> | | <code title="&#91;&#10;&#123;&#10;ip_cidr_range &#61; &#34;10.1.0.0&#47;20&#34;&#10;name &#61; &#34;transformation-subnet&#34;&#10;region &#61; &#34;europe-west3&#34;&#10;secondary_ip_range &#61; &#123;&#125;&#10;&#125;,&#10;&#93;">...</code> |
| *transformation_vpc_name* | Name of the VPC created in the transformation Project. | <code title="">string</code> | | <code title="">transformation-vpc</code> |
## Outputs
| name | description | sensitive |
|---|---|:---:|
| datamart-datasets | List of bigquery datasets created for the datamart project. | |
| dwh-datasets | List of bigquery datasets created for the dwh project. | |
| landing-buckets | List of buckets created for the landing project. | |
| landing-pubsub | List of pubsub topics and subscriptions created for the landing project. | |
| transformation-buckets | List of buckets created for the transformation project. | |
| transformation-vpc | Transformation VPC details | |
<!-- END TFDOC -->

Binary file not shown.

After

Width:  |  Height:  |  Size: 470 KiB

View File

@ -0,0 +1,163 @@
/**
* Copyright 2020 Google LLC
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
###############################################################################
# IAM #
###############################################################################
module "datamart-sa" {
source = "../../../modules/iam-service-account"
project_id = var.project_ids.datamart
name = var.service_account_names.datamart
iam_project_roles = {
"${var.project_ids.datamart}" = ["roles/editor"]
}
}
module "dwh-sa" {
source = "../../../modules/iam-service-account"
project_id = var.project_ids.dwh
name = var.service_account_names.dwh
}
module "landing-sa" {
source = "../../../modules/iam-service-account"
project_id = var.project_ids.landing
name = var.service_account_names.landing
iam_project_roles = {
"${var.project_ids.landing}" = ["roles/pubsub.publisher"]
}
}
module "services-sa" {
source = "../../../modules/iam-service-account"
project_id = var.project_ids.services
name = var.service_account_names.services
iam_project_roles = {
"${var.project_ids.services}" = ["roles/editor"]
}
}
module "transformation-sa" {
source = "../../../modules/iam-service-account"
project_id = var.project_ids.transformation
name = var.service_account_names.transformation
iam_project_roles = {
"${var.project_ids.transformation}" = [
"roles/logging.logWriter",
"roles/monitoring.metricWriter",
"roles/dataflow.admin",
"roles/iam.serviceAccountUser",
"roles/bigquery.dataOwner",
"roles/bigquery.jobUser",
"roles/dataflow.worker",
"roles/bigquery.metadataViewer",
"roles/storage.objectViewer",
]
}
}
###############################################################################
# GCS #
###############################################################################
module "landing-buckets" {
source = "../../../modules/gcs"
for_each = var.landing_buckets
project_id = var.project_ids.landing
prefix = var.project_ids.landing
name = each.value.name
location = each.value.location
iam = {
"roles/storage.objectCreator" = [module.landing-sa.iam_email]
"roles/storage.admin" = [module.transformation-sa.iam_email]
}
}
module "transformation-buckets" {
source = "../../../modules/gcs"
for_each = var.transformation_buckets
project_id = var.project_ids.transformation
prefix = var.project_ids.transformation
name = each.value.name
location = each.value.location
iam = {
"roles/storage.admin" = [module.transformation-sa.iam_email]
}
}
###############################################################################
# Bigquery #
###############################################################################
module "datamart-bq" {
source = "../../../modules/bigquery-dataset"
for_each = var.datamart_bq_datasets
project_id = var.project_ids.datamart
id = each.key
location = each.value.location
iam = {
for k, v in each.value.iam : k => (
k == "roles/bigquery.dataOwner"
? concat(v, [module.datamart-sa.iam_email])
: v
)
}
}
module "dwh-bq" {
source = "../../../modules/bigquery-dataset"
for_each = var.dwh_bq_datasets
project_id = var.project_ids.dwh
id = each.key
location = each.value.location
iam = {
for k, v in each.value.iam : k => (
k == "roles/bigquery.dataOwner"
? concat(v, [module.dwh-sa.iam_email])
: v
)
}
}
###############################################################################
# Network #
###############################################################################
module "vpc-transformation" {
source = "../../../modules/net-vpc"
project_id = var.project_ids.transformation
name = var.transformation_vpc_name
subnets = var.transformation_subnets
}
###############################################################################
# Pub/Sub #
###############################################################################
module "landing-pubsub" {
source = "../../../modules/pubsub"
for_each = var.landing_pubsub
project_id = var.project_ids.landing
name = each.key
subscriptions = {
for k, v in each.value : k => { labels = v.labels, options = v.options }
}
subscription_iam = {
for k, v in each.value : k => merge(v.iam, {
"roles/pubsub.subscriber" = [module.transformation-sa.iam_email]
})
}
}

View File

@ -0,0 +1,60 @@
/**
* Copyright 2020 Google LLC
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
output "datamart-datasets" {
description = "List of bigquery datasets created for the datamart project."
value = [
for k, datasets in module.datamart-bq : datasets.dataset_id
]
}
output "dwh-datasets" {
description = "List of bigquery datasets created for the dwh project."
value = [for k, datasets in module.dwh-bq : datasets.dataset_id]
}
output "landing-buckets" {
description = "List of buckets created for the landing project."
value = [for k, bucket in module.landing-buckets : bucket.name]
}
output "landing-pubsub" {
description = "List of pubsub topics and subscriptions created for the landing project."
value = {
for t in module.landing-pubsub : t.topic.name => {
id = t.topic.id
subscriptions = { for s in t.subscriptions : s.name => s.id }
}
}
}
output "transformation-buckets" {
description = "List of buckets created for the transformation project."
value = [for k, bucket in module.transformation-buckets : bucket.name]
}
output "transformation-vpc" {
description = "Transformation VPC details"
value = {
name = module.vpc-transformation.name
subnets = {
for k, s in module.vpc-transformation.subnets : k => {
ip_cidr_range = s.ip_cidr_range
region = s.region
}
}
}
}

View File

@ -0,0 +1,171 @@
# Copyright 2020 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
variable "datamart_bq_datasets" {
description = "Datamart Bigquery datasets"
type = map(object({
iam = map(list(string))
location = string
}))
default = {
bq_datamart_dataset = {
location = "EU"
iam = {
# "roles/bigquery.dataOwner" = []
# "roles/bigquery.dataEditor" = []
# "roles/bigquery.dataViewer" = []
}
}
}
}
variable "dwh_bq_datasets" {
description = "DWH Bigquery datasets"
type = map(object({
location = string
iam = map(list(string))
}))
default = {
bq_raw_dataset = {
iam = {}
location = "EU"
}
}
}
variable "landing_buckets" {
description = "List of landing buckets to create"
type = map(object({
location = string
name = string
}))
default = {
raw-data = {
location = "EU"
name = "raw-data"
}
data-schema = {
location = "EU"
name = "data-schema"
}
}
}
variable "landing_pubsub" {
description = "List of landing pubsub topics and subscriptions to create"
type = map(map(object({
iam = map(list(string))
labels = map(string)
options = object({
ack_deadline_seconds = number
message_retention_duration = number
retain_acked_messages = bool
expiration_policy_ttl = number
})
})))
default = {
landing-1 = {
sub1 = {
iam = {
# "roles/pubsub.subscriber" = []
}
labels = {}
options = null
}
sub2 = {
iam = {}
labels = {},
options = null
},
}
}
}
variable "landing_service_account" {
description = "landing service accounts list."
type = string
default = "sa-landing"
}
variable "project_ids" {
description = "Project IDs."
type = object({
datamart = string
dwh = string
landing = string
services = string
transformation = string
})
}
variable "service_account_names" {
description = "Project service accounts list."
type = object({
datamart = string
dwh = string
landing = string
services = string
transformation = string
})
default = {
datamart = "sa-datamart"
dwh = "sa-datawh"
landing = "sa-landing"
services = "sa-services"
transformation = "sa-transformation"
}
}
variable "transformation_buckets" {
description = "List of transformation buckets to create"
type = map(object({
location = string
name = string
}))
default = {
temp = {
location = "EU"
name = "temp"
},
templates = {
location = "EU"
name = "templates"
},
}
}
variable "transformation_subnets" {
description = "List of subnets to create in the transformation Project."
type = list(object({
ip_cidr_range = string
name = string
region = string
secondary_ip_range = map(string)
}))
default = [
{
ip_cidr_range = "10.1.0.0/20"
name = "transformation-subnet"
region = "europe-west3"
secondary_ip_range = {}
},
]
}
variable "transformation_vpc_name" {
description = "Name of the VPC created in the transformation Project."
type = string
default = "transformation-vpc"
}

View File

@ -0,0 +1,17 @@
# Copyright 2020 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
terraform {
required_version = ">= 0.13"
}

View File

@ -0,0 +1,8 @@
# Manual pipeline Example
Once you deployed projects [step 1](../infra/tf-phase1/README.md) and resources [step 1](../infra/tf-phase2/README.md) you can use it to run your data pipeline.
Here we will demo 2 pipelines:
* [GCS to Bigquery](./gcs_to_bigquery.md)
* [PubSub to Bigquery](./pubsub_to_bigquery.md)

View File

@ -0,0 +1,151 @@
# Manual pipeline Example: GCS to Bigquery
In this example we will publish person message in the following format:
```bash
Lorenzo,Caggioni,1617898199
```
a Dataflow pipeline will read those messages and import them into a Bigquery table in the DWH project.
[TODO] An autorized view will be created in the datamart project to expose the table.
[TODO] Remove hardcoded 'lcaggio' variables and made ENV variable for it.
[TODO] Further automation is expected in future.
Create and download keys for Service accounts you created.
## Create BQ table
Those steps should be done as Transformation Service Account:
```bash
gcloud auth activate-service-account sa-dwh@dwh-lc01.iam.gserviceaccount.com --key-file=sa-dwh.json --project=dwh-lc01
```
and you can run the command to create a table:
```bash
bq mk \
-t \
--description "This is a Test Person table" \
dwh-lc01:bq_raw_dataset.person \
name:STRING,surname:STRING,timestamp:TIMESTAMP
```
## Produce CSV data file, JSON schema file and UDF JS file
Those steps should be done as landing Service Account:
```bash
gcloud auth activate-service-account sa-landing@landing-lc01.iam.gserviceaccount.com --key-file=sa-landing.json --project=landing-lc01
```
Let's now create a series of messages we can use to import:
```bash
for i in {0..10}
do
echo "Lorenzo,Caggioni,$(date +%s)" >> person.csv
done
```
and copy files to the GCS bucket:
```bash
gsutil cp person.csv gs://landing-lc01-eu-raw-data
```
Let's create the data JSON schema:
```bash
cat <<'EOF' >> person_schema.json
{
"BigQuery Schema": [
{
"name": "name",
"type": "STRING"
},
{
"name": "surname",
"type": "STRING"
},
{
"name": "timestamp",
"type": "TIMESTAMP"
}
]
}
EOF
```
and copy files to the GCS bucket:
```bash
gsutil cp person_schema.json gs://landing-lc01-eu-data-schema
```
Let's create the data UDF function to transform message data:
```bash
cat <<'EOF' >> person_udf.js
function transform(line) {
var values = line.split(',');
var obj = new Object();
obj.name = values[0];
obj.surname = values[1];
obj.timestamp = values[2];
var jsonString = JSON.stringify(obj);
return jsonString;
}
EOF
```
and copy files to the GCS bucket:
```bash
gsutil cp person_udf.js gs://landing-lc01-eu-data-schema
```
if you want to check files copied to GCS, you can use the Transformation service account:
```bash
gcloud auth activate-service-account sa-transformation@transformation-lc01.iam.gserviceaccount.com --key-file=sa-transformation.json --project=transformation-lc01
```
and read a message (message won't be acked and will stay in the subscription):
```bash
gsutil ls gs://landing-lc01-eu-raw-data
gsutil ls gs://landing-lc01-eu-data-schema
```
## Dataflow
Those steps should be done as transformation Service Account:
```bash
gcloud auth activate-service-account sa-transformation@transformation-lc01.iam.gserviceaccount.com --key-file=sa-transformation.json --project=transformation-lc01
```
Let's than start a Dataflwo batch pipeline using a Google provided template using internal only IPs, the created network and subnetwork, the appropriate service account and requested parameters:
```bash
gcloud dataflow jobs run test_batch_lcaggio01 \
--gcs-location gs://dataflow-templates/latest/GCS_Text_to_BigQuery \
--project transformation-lc01 \
--region europe-west3 \
--disable-public-ips \
--network transformation-vpc \
--subnetwork regions/europe-west3/subnetworks/transformation-subnet \
--staging-location gs://transformation-lc01-eu-temp \
--service-account-email sa-transformation@transformation-lc01.iam.gserviceaccount.com \
--parameters \
javascriptTextTransformFunctionName=transform,\
JSONPath=gs://landing-lc01-eu-data-schema/person_schema.json,\
javascriptTextTransformGcsPath=gs://landing-lc01-eu-data-schema/person_udf.js,\
inputFilePattern=gs://landing-lc01-eu-raw-data/person.csv,\
outputTable=dwh-lc01:bq_raw_dataset.person,\
bigQueryLoadingTemporaryDirectory=gs://transformation-lc01-eu-temp
```

View File

@ -0,0 +1,96 @@
# Manual pipeline Example: PubSub to Bigquery
In this example we will publish person message in the following format:
```txt
name: Lorenzo
surname: Caggioni
timestamp: 1617898199
```
a Dataflow pipeline will read those messages and import them into a Bigquery table in the DWH project.
An autorized view will be created in the datamart project to expose the table.
[TODO] Remove hardcoded 'lcaggio' variables and made ENV variable for it.
[TODO] Further automation is expected in future.
Create and download keys for Service accounts you created, be sure to have `iam.serviceAccountKeys.create` permission on projects or at folder level.
```bash
gcloud iam service-accounts keys create sa-landing.json --iam-account=sa-landing@landing-lc01.iam.gserviceaccount.com
gcloud iam service-accounts keys create sa-transformation.json --iam-account=sa-transformation@transformation-lc01.iam.gserviceaccount.com
gcloud iam service-accounts keys create sa-dwh.json --iam-account=sa-dwh@dwh-lc01.iam.gserviceaccount.com
```
## Create BQ table
Those steps should be done as Transformation Service Account:
```bash
gcloud auth activate-service-account sa-dwh@dwh-lc01.iam.gserviceaccount.com --key-file=sa-dwh.json --project=dwh-lc01
```
and you can run the command to create a table:
```bash
bq mk \
-t \
--description "This is a Test Person table" \
dwh-lc01:bq_raw_dataset.person \
name:STRING,surname:STRING,timestamp:TIMESTAMP
```
## Produce PubSub messages
Those steps should be done as landing Service Account:
```bash
gcloud auth activate-service-account sa-landing@landing-lc01.iam.gserviceaccount.com --key-file=sa-landing.json --project=landing-lc01
```
and let's now create a series of messages we can use to import:
```bash
for i in {0..10}
do
gcloud pubsub topics publish projects/landing-lc01/topics/landing-1 --message="{\"name\": \"Lorenzo\", \"surname\": \"Caggioni\", \"timestamp\": \"$(date +%s)\"}"
done
```
if you want to check messages published, you can use the Transformation service account:
```bash
gcloud auth activate-service-account sa-transformation@transformation-lc01.iam.gserviceaccount.com --key-file=sa-transformation.json --project=transformation-lc01
```
and read a message (message won't be acked and will stay in the subscription):
```bash
gcloud pubsub subscriptions pull projects/landing-lc01/subscriptions/sub1
```
## Dataflow
Those steps should be done as transformation Service Account:
```bash
gcloud auth activate-service-account sa-transformation@transformation-lc01.iam.gserviceaccount.com --key-file=sa-transformation.json --project=transformation-lc01
```
Let's than start a Dataflwo streaming pipeline using a Google provided template using internal only IPs, the created network and subnetwork, the appropriate service account and requested parameters:
```bash
gcloud dataflow jobs run test_lcaggio01 \
--gcs-location gs://dataflow-templates/latest/PubSub_Subscription_to_BigQuery \
--project transformation-lc01 \
--region europe-west3 \
--disable-public-ips \
--network transformation-vpc \
--subnetwork regions/europe-west3/subnetworks/transformation-subnet \
--staging-location gs://transformation-lc01-eu-temp \
--service-account-email sa-transformation@transformation-lc01.iam.gserviceaccount.com \
--parameters \
inputSubscription=projects/landing-lc01/subscriptions/sub1,\
outputTableSpec=dwh-lc01:bq_raw_dataset.person
```

View File

@ -0,0 +1,26 @@
{
"schema": {
"fields": [
{
"mode": "NULLABLE",
"name": "name",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "surname",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "age",
"type": "INTEGER"
},
{
"mode": "NULLABLE",
"name": "boolean_val",
"type": "BOOLEAN"
}
]
}
}

View File

@ -0,0 +1,61 @@
# Data Foundation Platform
The goal of this example is to Build a robust and flexible Data Foundation on GCP, providing opinionated defaults while still allowing customers to quickly and reliably build and scale out additional data pipelines.
The example is composed of three separate provisioning workflows, which are deisgned to be plugged together and create end to end Data Foundations, that support multiple data pipelines on top.
1. **[Environment Setup](./01-environment/)**
*(once per environment)*
* projects
* VPC configuration
* Composer environment and identity
* shared buckets and datasets
1. **[Data Source Setup](./02-resources)**
*(once per data source)*
* landing and archive bucket
* internal and external identities
* domain specific datasets
1. **[Pipeline Setup](./03-pipeline)**
*(once per pipeline)*
* pipeline-specific tables and views
* pipeline code
* Composer DAG
The resulting GCP architecture is outlined in this diagram
![Target architecture](./02-resources/diagram.png)
A demo pipeline is also part of this example: it can be built and run on top of the foundational infrastructure to quickly verify or test the setup.
## Prerequisites
In order to bring up this example, you will need
- a folder or organization where new projects will be created
- a billing account that will be associated to new projects
- an identity (user or service account) with owner permissions on the folder or org, and billing user permissions on the billing account
## Bringing up the platform
The end-to-end example is composed of 2 foundational, and 1 optional steps:
1. [Environment setup](./01-environment/)
1. [Data source setup](./02-resources/)
1. (Optional) [Pipeline setup](./03-pipeline/)
The environment setup is designed to manage a single environment. Various strategies like workspaces, branching, or even separate clones can be used to support multiple environments.
## TODO
| Description | Priority (1:High - 5:Low ) | Status | Remarks |
|-------------|----------|:------:|---------|
| DLP best practices in the pipeline | 2 | Not Started | |
| KMS support (CMEK) | 2 | Not Started | |
| VPC-SC | 3 | Not Started | |
| Add Composer with a static DAG running the example | 3 | Not Started | |
| Integrate [CI/CD composer data processing workflow framework](https://github.com/jaketf/ci-cd-for-data-processing-workflow) | 3 | Not Started | |
| Schema changes, how to handle | 4 | Not Started | |
| Data lineage | 4 | Not Started | |
| Data quality checks | 4 | Not Started | |
| Shared-VPC | 5 | Not Started | |
| Logging & monitoring | TBD | Not Started | |
| Orcestration for ingestion pipeline (just in the readme) | TBD | Not Started | |

View File

@ -0,0 +1,13 @@
# Copyright 2021 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

View File

@ -0,0 +1,26 @@
/**
* Copyright 2021 Google LLC
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
module "test-environment" {
source = "../../../../data-solutions/data-platform-foundations/01-environment"
billing_account_id = var.billing_account
root_node = var.root_node
}
module "test-resources" {
source = "../../../../data-solutions/data-platform-foundations/02-resources"
project_ids = module.test-environment.project_ids
}

View File

@ -0,0 +1,26 @@
/**
* Copyright 2021 Google LLC
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
variable "billing_account" {
type = string
default = "123456-123456-123456"
}
variable "root_node" {
description = "The resource name of the parent Folder or Organization. Must be of the form folders/folder_id or organizations/org_id."
type = string
default = "folders/12345678"
}

View File

@ -0,0 +1,27 @@
# Copyright 2021 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import pytest
FIXTURES_DIR = os.path.join(os.path.dirname(__file__), 'fixture')
def test_resources(e2e_plan_runner):
"Test that plan works and the numbers of resources is as expected."
modules, resources = e2e_plan_runner(FIXTURES_DIR)
assert len(modules) == 6
assert len(resources) == 32