Bugfixing Data Foundations (#310)

* Bugfixing Data Foundations and impersonation support
- Fixed SA permissions
- Usage of impersonation to avoid SA private key export
- Fixed required API enablement
- Added FW rules required by dataflow
- Added provider for sa impersonation
This commit is contained in:
javiergp 2021-09-28 17:13:18 +02:00 committed by GitHub
parent 8b69638f89
commit 15b2736a7c
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
10 changed files with 171 additions and 112 deletions

View File

@ -25,10 +25,12 @@ To create the infrastructure:
```tfm
billing_account = "1234-1234-1234"
parent = "folders/12345678"
admins = ["user:xxxxx@yyyyy.com"]
```
- make sure you have the right authentication setup (application default credentials, or a service account key)
- make sure you have the right authentication setup (application default credentials, or a service account key) with the right permissions
- **The output of this stage contains the values for the resources stage**
- the `admins` variable contain a list of principals allowed to impersonate the service accounts. These principals will be given the `iam.serviceAccountTokenCreator` role
- run `terraform init` and `terraform apply`
Once done testing, you can clean up resources by running `terraform destroy`.
@ -57,6 +59,9 @@ The script use 'google_access_context_manager_service_perimeter_resource' terraf
| *service_account_names* | Override this variable if you need non-standard names. | <code title="object&#40;&#123;&#10;main &#61; string&#10;&#125;&#41;">object({...})</code> | | <code title="&#123;&#10;main &#61; &#34;data-platform-main&#34;&#10;&#125;">...</code> |
| *service_encryption_key_ids* | Cloud KMS encryption key in {LOCATION => [KEY_URL]} format. Keys belong to existing project. | <code title="object&#40;&#123;&#10;multiregional &#61; string&#10;global &#61; string&#10;&#125;&#41;">object({...})</code> | | <code title="&#123;&#10;multiregional &#61; null&#10;global &#61; null&#10;&#125;">...</code> |
| *service_perimeter_standard* | VPC Service control standard perimeter name in the form of 'accessPolicies/ACCESS_POLICY_NAME/servicePerimeters/PERIMETER_NAME'. All projects will be added to the perimeter in enforced mode. | <code title="">string</code> | | <code title="">null</code> |
| *admins* | List of users allowed to impersonate the service account | <code title="">list</code> | | <code title="">null</code> |
## Outputs

View File

@ -31,8 +31,9 @@ module "project-datamart" {
"storage.googleapis.com",
"storage-component.googleapis.com",
]
iam = {
"roles/editor" = [module.sa-services-main.iam_email]
iam_additive = {
"roles/owner" = [module.sa-services-main.iam_email]
}
service_encryption_key_ids = {
bq = [var.service_encryption_key_ids.multiregional]
@ -56,8 +57,8 @@ module "project-dwh" {
"storage.googleapis.com",
"storage-component.googleapis.com",
]
iam = {
"roles/editor" = [module.sa-services-main.iam_email]
iam_additive = {
"roles/owner" = [module.sa-services-main.iam_email]
}
service_encryption_key_ids = {
bq = [var.service_encryption_key_ids.multiregional]
@ -79,8 +80,8 @@ module "project-landing" {
"storage.googleapis.com",
"storage-component.googleapis.com",
]
iam = {
"roles/editor" = [module.sa-services-main.iam_email]
iam_additive = {
"roles/owner" = [module.sa-services-main.iam_email]
}
service_encryption_key_ids = {
pubsub = [var.service_encryption_key_ids.global]
@ -98,6 +99,10 @@ module "project-services" {
prefix = var.prefix
name = var.project_names.services
services = [
"bigquery.googleapis.com",
"cloudresourcemanager.googleapis.com",
"iam.googleapis.com",
"pubsub.googleapis.com",
"storage.googleapis.com",
"storage-component.googleapis.com",
"sourcerepo.googleapis.com",
@ -105,8 +110,8 @@ module "project-services" {
"cloudasset.googleapis.com",
"cloudkms.googleapis.com"
]
iam = {
"roles/editor" = [module.sa-services-main.iam_email]
iam_additive = {
"roles/owner" = [module.sa-services-main.iam_email]
}
service_encryption_key_ids = {
storage = [var.service_encryption_key_ids.multiregional]
@ -123,6 +128,7 @@ module "project-transformation" {
prefix = var.prefix
name = var.project_names.transformation
services = [
"bigquery.googleapis.com",
"cloudbuild.googleapis.com",
"compute.googleapis.com",
"dataflow.googleapis.com",
@ -130,8 +136,8 @@ module "project-transformation" {
"storage.googleapis.com",
"storage-component.googleapis.com",
]
iam = {
"roles/editor" = [module.sa-services-main.iam_email]
iam_additive = {
"roles/owner" = [module.sa-services-main.iam_email]
}
service_encryption_key_ids = {
compute = [var.service_encryption_key_ids.global]
@ -151,4 +157,6 @@ module "sa-services-main" {
source = "../../../modules/iam-service-account"
project_id = module.project-services.project_id
name = var.service_account_names.main
iam = var.admins != null ? { "roles/iam.serviceAccountTokenCreator" = var.admins } : {}
}

View File

@ -12,6 +12,12 @@
# See the License for the specific language governing permissions and
# limitations under the License.
variable "admins" {
description = "List of users allowed to impersonate the service account"
type = list(string)
default = null
}
variable "billing_account_id" {
description = "Billing account id."
type = string

View File

@ -26,7 +26,7 @@ In the previous step, we created the environment (projects and service account)
To create the resources, copy the output of the environment step (**project_ids**) and paste it into the `terraform.tvars`:
- Specify your variables in a `terraform.tvars`, you can use the ouptu from the environment stage
- Specify your variables in a `terraform.tvars`, you can use the output from the environment stage
```tfm
project_ids = {
@ -38,15 +38,14 @@ project_ids = {
}
```
- Get a key for the service account created in the environment stage:
- Go into services project
- Go into IAM page
- Go into the service account section
- Creaet a new key for the service account created in previeous step (**service_account**)
- Download the json key into the current folder
- make sure you have the right authentication setup: `export GOOGLE_APPLICATION_CREDENTIALS=PATH_TO_SERVICE_ACCOUT_KEY.json`
- run `terraform init` and `terraform apply`
- The providers.tf file has been configured to impersonate the **main** service account
- To launch terraform:
```bash
terraform plan
terraform apply
```
Once done testing, you can clean up resources by running `terraform destroy`.
### CMEK configuration
@ -68,6 +67,8 @@ You can configure GCP resources to use existing CMEK keys configuring the 'servi
| *transformation_buckets* | List of transformation buckets to create | <code title="map&#40;object&#40;&#123;&#10;location &#61; string&#10;name &#61; string&#10;&#125;&#41;&#41;">map(object({...}))</code> | | <code title="&#123;&#10;temp &#61; &#123;&#10;location &#61; &#34;EU&#34;&#10;name &#61; &#34;temp&#34;&#10;&#125;,&#10;templates &#61; &#123;&#10;location &#61; &#34;EU&#34;&#10;name &#61; &#34;templates&#34;&#10;&#125;,&#10;&#125;">...</code> |
| *transformation_subnets* | List of subnets to create in the transformation Project. | <code title="list&#40;object&#40;&#123;&#10;ip_cidr_range &#61; string&#10;name &#61; string&#10;region &#61; string&#10;secondary_ip_range &#61; map&#40;string&#41;&#10;&#125;&#41;&#41;">list(object({...}))</code> | | <code title="&#91;&#10;&#123;&#10;ip_cidr_range &#61; &#34;10.1.0.0&#47;20&#34;&#10;name &#61; &#34;transformation-subnet&#34;&#10;region &#61; &#34;europe-west3&#34;&#10;secondary_ip_range &#61; &#123;&#125;&#10;&#125;,&#10;&#93;">...</code> |
| *transformation_vpc_name* | Name of the VPC created in the transformation Project. | <code title="">string</code> | | <code title="">transformation-vpc</code> |
| *admins* | List of users allowed to impersonate the service account | <code title="">list</code> | | <code title="">null</code> |
## Outputs

View File

@ -25,12 +25,18 @@ module "datamart-sa" {
iam_project_roles = {
"${var.project_ids.datamart}" = ["roles/editor"]
}
iam = var.admins != null ? { "roles/iam.serviceAccountTokenCreator" = var.admins } : {}
}
module "dwh-sa" {
source = "../../../modules/iam-service-account"
project_id = var.project_ids.dwh
name = var.service_account_names.dwh
iam_project_roles = {
"${var.project_ids.dwh}" = ["roles/bigquery.admin"]
}
iam = var.admins != null ? { "roles/iam.serviceAccountTokenCreator" = var.admins } : {}
}
module "landing-sa" {
@ -38,8 +44,11 @@ module "landing-sa" {
project_id = var.project_ids.landing
name = var.service_account_names.landing
iam_project_roles = {
"${var.project_ids.landing}" = ["roles/pubsub.publisher"]
"${var.project_ids.landing}" = [
"roles/pubsub.publisher",
"roles/storage.objectCreator"]
}
iam = var.admins != null ? { "roles/iam.serviceAccountTokenCreator" = var.admins } : {}
}
module "services-sa" {
@ -49,6 +58,7 @@ module "services-sa" {
iam_project_roles = {
"${var.project_ids.services}" = ["roles/editor"]
}
iam = var.admins != null ? { "roles/iam.serviceAccountTokenCreator" = var.admins } : {}
}
module "transformation-sa" {
@ -66,8 +76,17 @@ module "transformation-sa" {
"roles/dataflow.worker",
"roles/bigquery.metadataViewer",
"roles/storage.objectViewer",
],
"${var.project_ids.landing}" = [
"roles/storage.objectViewer",
],
"${var.project_ids.dwh}" = [
"roles/bigquery.dataOwner",
"roles/bigquery.jobUser",
"roles/bigquery.metadataViewer",
]
}
iam = var.admins != null ? { "roles/iam.serviceAccountTokenCreator" = var.admins } : {}
}
###############################################################################
@ -147,6 +166,31 @@ module "vpc-transformation" {
subnets = var.transformation_subnets
}
module "firewall" {
source = "../../../modules/net-vpc-firewall"
project_id = var.project_ids.transformation
network = module.vpc-transformation.name
admin_ranges_enabled = false
admin_ranges = [""]
http_source_ranges = []
https_source_ranges = []
ssh_source_ranges = []
custom_rules = {
iap-svc = {
description = "Dataflow service."
direction = "INGRESS"
action = "allow"
sources = ["dataflow"]
targets = ["dataflow"]
ranges = []
use_service_accounts = false
rules = [{ protocol = "tcp", ports = ["12345-12346"] }]
extra_attributes = {}
}
}
}
###############################################################################
# Pub/Sub #
###############################################################################

View File

@ -0,0 +1,20 @@
# Copyright 2020 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
provider "google" {
impersonate_service_account = "data-platform-main@${var.project_ids.services}.iam.gserviceaccount.com"
}
provider "google-beta" {
impersonate_service_account = "data-platform-main@${var.project_ids.services}.iam.gserviceaccount.com"
}

View File

@ -12,6 +12,13 @@
# See the License for the specific language governing permissions and
# limitations under the License.
variable "admins" {
description = "List of users allowed to impersonate the service account"
type = list(string)
default = null
}
variable "datamart_bq_datasets" {
description = "Datamart Bigquery datasets"
type = map(object({

View File

@ -3,43 +3,38 @@
In this example we will publish person message in the following format:
```bash
Lorenzo,Caggioni,1617898199
name,surname,1617898199
```
a Dataflow pipeline will read those messages and import them into a Bigquery table in the DWH project.
A Dataflow pipeline will read those messages and import them into a Bigquery table in the DWH project.
[TODO] An autorized view will be created in the datamart project to expose the table.
[TODO] Remove hardcoded 'lcaggio' variables and made ENV variable for it.
[TODO] Further automation is expected in future.
Create and download keys for Service accounts you created.
## Create BQ table
Those steps should be done as Transformation Service Account:
## Set up the env vars
```bash
gcloud auth activate-service-account sa-dwh@dwh-lc01.iam.gserviceaccount.com --key-file=sa-dwh.json --project=dwh-lc01
export DWH_PROJECT_ID=**dwh_project_id**
export LANDING_PROJECT_ID=**landing_project_id**
export TRANSFORMATION_PROJECT_ID=*transformation_project_id*
```
and you can run the command to create a table:
## Create BQ table
Those steps should be done as DWH Service Account.
You can run the command to create a table:
```bash
bq mk \
-t \
gcloud --impersonate-service-account=sa-datawh@$DWH_PROJECT_ID.iam.gserviceaccount.com \
alpha bq tables create person \
--project=$DWH_PROJECT_ID --dataset=bq_raw_dataset \
--description "This is a Test Person table" \
dwh-lc01:bq_raw_dataset.person \
name:STRING,surname:STRING,timestamp:TIMESTAMP
--schema name=STRING,surname=STRING,timestamp=TIMESTAMP
```
## Produce CSV data file, JSON schema file and UDF JS file
Those steps should be done as landing Service Account:
```bash
gcloud auth activate-service-account sa-landing@landing-lc01.iam.gserviceaccount.com --key-file=sa-landing.json --project=landing-lc01
```
Let's now create a series of messages we can use to import:
```bash
@ -52,7 +47,7 @@ done
and copy files to the GCS bucket:
```bash
gsutil cp person.csv gs://landing-lc01-eu-raw-data
gsutil -i sa-landing@$LANDING_PROJECT_ID.iam.gserviceaccount.com cp person.csv gs://$LANDING_PROJECT_ID-eu-raw-data
```
Let's create the data JSON schema:
@ -81,7 +76,8 @@ EOF
and copy files to the GCS bucket:
```bash
gsutil cp person_schema.json gs://landing-lc01-eu-data-schema
gsutil -i sa-landing@$LANDING_PROJECT_ID.iam.gserviceaccount.com cp person_schema.json gs://$LANDING_PROJECT_ID-eu-data-schema
```
Let's create the data UDF function to transform message data:
@ -105,47 +101,40 @@ EOF
and copy files to the GCS bucket:
```bash
gsutil cp person_udf.js gs://landing-lc01-eu-data-schema
gsutil -i sa-landing@$LANDING_PROJECT_ID.iam.gserviceaccount.com cp person_udf.js gs://$LANDING_PROJECT_ID-eu-data-schema
```
if you want to check files copied to GCS, you can use the Transformation service account:
```bash
gcloud auth activate-service-account sa-transformation@transformation-lc01.iam.gserviceaccount.com --key-file=sa-transformation.json --project=transformation-lc01
```
gsutil -i sa-transformation@$TRANSFORMATION_PROJECT_ID.iam.gserviceaccount.com ls gs://$LANDING_PROJECT_ID-eu-raw-data
gsutil -i sa-transformation@$TRANSFORMATION_PROJECT_ID.iam.gserviceaccount.com ls gs://$LANDING_PROJECT_ID-eu-data-schema
and read a message (message won't be acked and will stay in the subscription):
```bash
gsutil ls gs://landing-lc01-eu-raw-data
gsutil ls gs://landing-lc01-eu-data-schema
```
## Dataflow
Those steps should be done as transformation Service Account:
Those steps should be done as transformation Service Account.
Let's than start a Dataflow batch pipeline using a Google provided template using internal only IPs, the created network and subnetwork, the appropriate service account and requested parameters:
```bash
gcloud auth activate-service-account sa-transformation@transformation-lc01.iam.gserviceaccount.com --key-file=sa-transformation.json --project=transformation-lc01
```
Let's than start a Dataflwo batch pipeline using a Google provided template using internal only IPs, the created network and subnetwork, the appropriate service account and requested parameters:
```bash
gcloud dataflow jobs run test_batch_lcaggio01 \
gcloud --impersonate-service-account=sa-transformation@$TRANSFORMATION_PROJECT_ID.iam.gserviceaccount.com dataflow jobs run test_batch_01 \
--gcs-location gs://dataflow-templates/latest/GCS_Text_to_BigQuery \
--project transformation-lc01 \
--project $TRANSFORMATION_PROJECT_ID \
--region europe-west3 \
--disable-public-ips \
--network transformation-vpc \
--subnetwork regions/europe-west3/subnetworks/transformation-subnet \
--staging-location gs://transformation-lc01-eu-temp \
--service-account-email sa-transformation@transformation-lc01.iam.gserviceaccount.com \
--staging-location gs://$TRANSFORMATION_PROJECT_ID-eu-temp \
--service-account-email sa-transformation@$TRANSFORMATION_PROJECT_ID.iam.gserviceaccount.com \
--parameters \
javascriptTextTransformFunctionName=transform,\
JSONPath=gs://landing-lc01-eu-data-schema/person_schema.json,\
javascriptTextTransformGcsPath=gs://landing-lc01-eu-data-schema/person_udf.js,\
inputFilePattern=gs://landing-lc01-eu-raw-data/person.csv,\
outputTable=dwh-lc01:bq_raw_dataset.person,\
bigQueryLoadingTemporaryDirectory=gs://transformation-lc01-eu-temp
```
JSONPath=gs://$LANDING_PROJECT_ID-eu-data-schema/person_schema.json,\
javascriptTextTransformGcsPath=gs://$LANDING_PROJECT_ID-eu-data-schema/person_udf.js,\
inputFilePattern=gs://$LANDING_PROJECT_ID-eu-raw-data/person.csv,\
outputTable=$DWH_PROJECT_ID:bq_raw_dataset.person,\
bigQueryLoadingTemporaryDirectory=gs://$TRANSFORMATION_PROJECT_ID-eu-temp
```

View File

@ -3,8 +3,8 @@
In this example we will publish person message in the following format:
```txt
name: Lorenzo
surname: Caggioni
name: Name
surname: Surname
timestamp: 1617898199
```
@ -12,85 +12,64 @@ a Dataflow pipeline will read those messages and import them into a Bigquery tab
An autorized view will be created in the datamart project to expose the table.
[TODO] Remove hardcoded 'lcaggio' variables and made ENV variable for it.
[TODO] Further automation is expected in future.
Create and download keys for Service accounts you created, be sure to have `iam.serviceAccountKeys.create` permission on projects or at folder level.
## Set up the env vars
```bash
gcloud iam service-accounts keys create sa-landing.json --iam-account=sa-landing@landing-lc01.iam.gserviceaccount.com
gcloud iam service-accounts keys create sa-transformation.json --iam-account=sa-transformation@transformation-lc01.iam.gserviceaccount.com
gcloud iam service-accounts keys create sa-dwh.json --iam-account=sa-dwh@dwh-lc01.iam.gserviceaccount.com
export DWH_PROJECT_ID=**dwh_project_id**
export LANDING_PROJECT_ID=**landing_project_id**
export TRANSFORMATION_PROJECT_ID=*transformation_project_id*
```
## Create BQ table
Those steps should be done as DWH Service Account.
Those steps should be done as Transformation Service Account:
You can run the command to create a table:
```bash
gcloud auth activate-service-account sa-dwh@dwh-lc01.iam.gserviceaccount.com --key-file=sa-dwh.json --project=dwh-lc01
```
and you can run the command to create a table:
```bash
bq mk \
-t \
gcloud --impersonate-service-account=sa-datawh@$DWH_PROJECT_ID.iam.gserviceaccount.com \
alpha bq tables create person \
--project=$DWH_PROJECT_ID --dataset=bq_raw_dataset \
--description "This is a Test Person table" \
dwh-lc01:bq_raw_dataset.person \
name:STRING,surname:STRING,timestamp:TIMESTAMP
--schema name=STRING,surname=STRING,timestamp=TIMESTAMP
```
## Produce PubSub messages
Those steps should be done as landing Service Account:
```bash
gcloud auth activate-service-account sa-landing@landing-lc01.iam.gserviceaccount.com --key-file=sa-landing.json --project=landing-lc01
```
and let's now create a series of messages we can use to import:
Let's now create a series of messages we can use to import:
```bash
for i in {0..10}
do
gcloud pubsub topics publish projects/landing-lc01/topics/landing-1 --message="{\"name\": \"Lorenzo\", \"surname\": \"Caggioni\", \"timestamp\": \"$(date +%s)\"}"
gcloud --impersonate-service-account=sa-landing@$LANDING_PROJECT_ID.iam.gserviceaccount.com pubsub topics publish projects/$LANDING_PROJECT_ID/topics/landing-1 --message="{\"name\": \"Lorenzo\", \"surname\": \"Caggioni\", \"timestamp\": \"$(date +%s)\"}"
done
```
if you want to check messages published, you can use the Transformation service account:
if you want to check messages published, you can use the Transformation service account and read a message (message won't be acked and will stay in the subscription):
```bash
gcloud auth activate-service-account sa-transformation@transformation-lc01.iam.gserviceaccount.com --key-file=sa-transformation.json --project=transformation-lc01
```
and read a message (message won't be acked and will stay in the subscription):
```bash
gcloud pubsub subscriptions pull projects/landing-lc01/subscriptions/sub1
gcloud --impersonate-service-account=sa-transformation@$TRANSFORMATION_PROJECT_ID.iam.gserviceaccount.com pubsub subscriptions pull projects/$LANDING_PROJECT_ID/subscriptions/sub1
```
## Dataflow
Those steps should be done as transformation Service Account:
```bash
gcloud auth activate-service-account sa-transformation@transformation-lc01.iam.gserviceaccount.com --key-file=sa-transformation.json --project=transformation-lc01
```
Let's than start a Dataflwo streaming pipeline using a Google provided template using internal only IPs, the created network and subnetwork, the appropriate service account and requested parameters:
Let's than start a Dataflow streaming pipeline using a Google provided template using internal only IPs, the created network and subnetwork, the appropriate service account and requested parameters:
```bash
gcloud dataflow jobs run test_lcaggio01 \
gcloud dataflow jobs run test_streaming01 \
--gcs-location gs://dataflow-templates/latest/PubSub_Subscription_to_BigQuery \
--project transformation-lc01 \
--project $TRANSFORMATION_PROJECT_ID \
--region europe-west3 \
--disable-public-ips \
--network transformation-vpc \
--subnetwork regions/europe-west3/subnetworks/transformation-subnet \
--staging-location gs://transformation-lc01-eu-temp \
--service-account-email sa-transformation@transformation-lc01.iam.gserviceaccount.com \
--staging-location gs://$TRANSFORMATION_PROJECT_ID-eu-temp \
--service-account-email sa-transformation@$TRANSFORMATION_PROJECT_ID.iam.gserviceaccount.com \
--parameters \
inputSubscription=projects/landing-lc01/subscriptions/sub1,\
outputTableSpec=dwh-lc01:bq_raw_dataset.person
inputSubscription=projects/$LANDING_PROJECT_ID/subscriptions/sub1,\
outputTableSpec=$DWH_PROJECT_ID:bq_raw_dataset.person
```

View File

@ -24,4 +24,4 @@ def test_resources(e2e_plan_runner):
"Test that plan works and the numbers of resources is as expected."
modules, resources = e2e_plan_runner(FIXTURES_DIR)
assert len(modules) == 6
assert len(resources) == 45
assert len(resources) == 53