Changes to gcs to bq least privilege example (#447)

* Changes to gcs to bq least privilege example * Fix 'try' on encryption variables * Fix roles * Fix tests * Use templatefile in output variables * Remove FIXME * Fix tests * Changes to gcs to bq least privilege example * Fix 'try' on encryption variables * Fix roles * Fix tests * Use templatefile in output variables * Remove FIXME * Fix tests * Merge branch 'jccb/gcs-to-bq-changes' of https://github.com/GoogleCloudPlatform/cloud-foundation-fabric into jccb/gcs-to-bq-changes * fix readme and template * fix readme * Update FIXME. Co-authored-by: Lorenzo Caggioni <lorenzo.caggioni@gmail.com> Co-authored-by: Ludovico Magnocavallo <ludomagno@google.com>
2022-02-02 08:32:59 +01:00 · 2022-02-02 08:32:59 +01:00 · 5396735bc6
parent 98b238ae7a
commit 5396735bc6
11 changed files with 81 additions and 76 deletions
--- a/examples/data-solutions/gcs-to-bq-with-least-privileges/README.md
+++ b/examples/data-solutions/gcs-to-bq-with-least-privileges/README.md
@ -3,17 +3,18 @@
 This example creates the infrastructure needed to run a [Cloud Dataflow](https://cloud.google.com/dataflow) pipeline to import data from [GCS](https://cloud.google.com/storage) to [Bigquery](https://cloud.google.com/bigquery). The example will create different service accounts with least privileges on resources. To run the pipeline, users listed in `data_eng_principals` can impersonate all those service accounts.

 The solution will use:
- - internal IPs for GCE and Cloud Dataflow instances
- - Cloud NAT to let resources egress to the Internet, to run system updates and install packages
- - rely on [Service Account Impersonation](https://cloud.google.com/iam/docs/impersonating-service-accounts) to avoid the use of service account keys
- - Service Accounts with least privilege on each resource
- - (Optional) CMEK encription for GCS bucket, DataFlow instances and BigQuery tables
+- internal IPs for GCE and Cloud Dataflow instances
+- Cloud NAT to let resources egress to the Internet, to run system updates and install packages
+- rely on [Service Account Impersonation](https://cloud.google.com/iam/docs/impersonating-service-accounts) to avoid the use of service account keys
+- Service Accounts with least privilege on each resource
+- (Optional) CMEK encription for GCS bucket, DataFlow instances and BigQuery tables

-The example is designed to match real-world use cases with a minimum amount of resources and some compromise listed below. It can be used as a starting point for more complex scenarios.
+The example is designed to match real-world use cases with a minimum amount of resources and some compromises listed below. It can be used as a starting point for more complex scenarios.

 This is the high level diagram:

 ![GCS to Biquery High-level diagram](diagram.png "GCS to Biquery High-level diagram")
+
 ## Move to real use case consideration
 In the example we implemented some compromise to keep the example minimal and easy to read. On a real word use case, you may evaluate the option to:
 - Configure a Shared-VPC
@ -93,13 +94,13 @@ We need to create 3 file:
 - A `person_udf.js` containing the UDF javascript file used by the Dataflow template.
 - A `person_schema.json` file containing the table schema used to import the CSV.
 
-You can find an example of those file in the folder `./data-demo`. You can copy the example files in the GCS bucket using the  command returned in the terraform output as `command-01-gcs`. Below an example:
+You can find an example of those file in the folder `./data-demo`. You can copy the example files in the GCS bucket using the  command returned in the terraform output as `command_01_gcs`. Below an example:

 ```bash
 gsutil -i gcs-landing@PROJECT.iam.gserviceaccount.com cp data-demo/* gs://LANDING_BUCKET
 ```

-We can now run the Dataflow pipeline using the `gcloud` returned in the terraform output as `command-02-dataflow`. Below an example:
+We can now run the Dataflow pipeline using the `gcloud` returned in the terraform output as `command_02_dataflow`. Below an example:

 ```bash
 gcloud --impersonate-service-account=orch-test@PROJECT.iam.gserviceaccount.com dataflow jobs run test_batch_01 \
@ -119,7 +120,7 @@ outputTable=PROJECT:datalake.person,\
 bigQueryLoadingTemporaryDirectory=gs://PREFIX-df-tmp 
 ```

-You can check data imported into Google BigQuery using the  command returned in the terraform output as `command-03-bq`. Below an example:
+You can check data imported into Google BigQuery using the  command returned in the terraform output as `command_03_bq`. Below an example:

 ```
 bq query --use_legacy_sql=false 'SELECT * FROM `PROJECT.datalake.person` LIMIT 1000'
@ -144,10 +145,10 @@ bq query --use_legacy_sql=false 'SELECT * FROM `PROJECT.datalake.person` LIMIT 1
 |---|---|:---:|
 | [bq_tables](outputs.tf#L15) | Bigquery Tables. |  |
 | [buckets](outputs.tf#L20) | GCS bucket Cloud KMS crypto keys. |  |
-| [command-01-gcs](outputs.tf#L43) | gcloud command to copy data into the created bucket impersonating the service account. |  |
-| [command-02-dataflow](outputs.tf#L48) | Command to run Dataflow template impersonating the service account. |  |
-| [command-03-bq](outputs.tf#L70) | BigQuery command to query imported data. |  |
+| [command_01_gcs](outputs.tf#L43) | gcloud command to copy data into the created bucket impersonating the service account. |  |
+| [command_02_dataflow](outputs.tf#L48) | Command to run Dataflow template impersonating the service account. |  |
+| [command_03_bq](outputs.tf#L69) | BigQuery command to query imported data. |  |
 | [project_id](outputs.tf#L28) | Project id. |  |
-| [serviceaccount](outputs.tf#L33) | Service account. |  |
+| [service_accounts](outputs.tf#L33) | Service account. |  |

 <!-- END TFDOC -->
--- a/examples/data-solutions/gcs-to-bq-with-least-privileges/bigquery.tftpl
+++ b/examples/data-solutions/gcs-to-bq-with-least-privileges/bigquery.tftpl
@ -0,0 +1,2 @@
+bq query --project_id=${project_id} --use_legacy_sql=false \
+  'SELECT * FROM `${project_id}.${bigquery_dataset}.${bigquery_table}` LIMIT ${sql_limit}'
--- a/examples/data-solutions/gcs-to-bq-with-least-privileges/dataflow.tftpl
+++ b/examples/data-solutions/gcs-to-bq-with-least-privileges/dataflow.tftpl
@ -0,0 +1,18 @@
+gcloud \
+  --impersonate-service-account=${sa_orch_email} \
+  dataflow jobs run test_batch_01 \
+  --gcs-location gs://dataflow-templates/latest/GCS_Text_to_BigQuery \
+  --project ${project_id} \
+  --region ${region} \
+  --disable-public-ips \
+  --subnetwork ${subnet} \
+  --staging-location ${gcs_df_stg} \
+  --service-account-email ${sa_df_email} \
+  %{ if cmek_encryption }--dataflow-kms-key=${kms_key_df} %{ endif } \
+  --parameters \
+javascriptTextTransformFunctionName=transform,\
+JSONPath=${data_schema_file},\
+javascriptTextTransformGcsPath=${data_udf_file},\
+inputFilePattern=${data_file},\
+outputTable=${project_id}:${bigquery_dataset}.${bigquery_table},\
+bigQueryLoadingTemporaryDirectory=${gcs_df_tmp}
--- a/examples/data-solutions/gcs-to-bq-with-least-privileges/datastorage.tf
+++ b/examples/data-solutions/gcs-to-bq-with-least-privileges/datastorage.tf
@ -12,10 +12,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

-###############################################################################
-#                                   GCS                                       #
-###############################################################################
-
 module "gcs-data" {
  source         = "../../../modules/gcs"
  project_id     = module.project.project_id
@ -23,7 +19,7 @@ module "gcs-data" {
  name           = "data"
  location       = var.region
  storage_class  = "REGIONAL"
-  encryption_key = var.cmek_encryption ? try(module.kms[0].keys.key-gcs.id, null) : null
+  encryption_key = var.cmek_encryption ? module.kms[0].keys.key-gcs.id : null
  force_destroy  = true
 }

@ -34,22 +30,20 @@ module "gcs-df-tmp" {
  name           = "df-tmp"
  location       = var.region
  storage_class  = "REGIONAL"
-  encryption_key = var.cmek_encryption ? try(module.kms[0].keys.key-gcs.id, null) : null
+  encryption_key = var.cmek_encryption ? module.kms[0].keys.key-gcs.id : null
  force_destroy  = true
 }

-###############################################################################
-#                                   BQ                                        #
-###############################################################################
-
 module "bigquery-dataset" {
  source     = "../../../modules/bigquery-dataset"
  project_id = module.project.project_id
  id         = "datalake"
  location   = var.region
-  # Define Tables in Terraform for the porpuse of the example. 
-  # Probably in a production environment you would handle Tables creation in a 
-  # separate Terraform State or using a different tool/pipeline (for example: Dataform).
+
+  # Note: we define tables in Terraform for the purpose of this
+  # example. A production environment would probably handle table
+  # creation in a separate terraform pipeline or using a different
+  # tool (for example: Dataform)
  tables = {
    person = {
      friendly_name = "Person. Dataflow import."
@ -64,10 +58,10 @@ module "bigquery-dataset" {
      deletion_protection = false
      options = {
        clustering      = null
-        encryption_key  = var.cmek_encryption ? try(module.kms[0].keys.key-bq.id, null) : null
+        encryption_key  = var.cmek_encryption ? module.kms[0].keys.key-bq.id : null
        expiration_time = null
      }
    }
  }
-  encryption_key = var.cmek_encryption ? try(module.kms[0].keys.key-bq.id, null) : null
+  encryption_key = var.cmek_encryption ? module.kms[0].keys.key-bq.id : null
 }
--- a/examples/data-solutions/gcs-to-bq-with-least-privileges/kms.tf
+++ b/examples/data-solutions/gcs-to-bq-with-least-privileges/kms.tf
@ -17,7 +17,7 @@ module "kms" {
  source     = "../../../modules/kms"
  project_id = module.project.project_id
  keyring = {
-    name     = "${var.prefix}-keyring",
+    name     = "${var.prefix}-keyring"
    location = var.region
  }
  keys = {
--- a/examples/data-solutions/gcs-to-bq-with-least-privileges/main.tf
+++ b/examples/data-solutions/gcs-to-bq-with-least-privileges/main.tf
@ -58,19 +58,18 @@ locals {
      module.service-account-orch.iam_email,
    ]
    "roles/iam.serviceAccountTokenCreator" = concat(
-      var.data_eng_principals,
-    )
-    "roles/viewer" = concat(
      var.data_eng_principals
    )
    # Dataflow roles
-    "roles/dataflow.admin" = concat([
-      module.service-account-orch.iam_email,
-      ], var.data_eng_principals
+    "roles/dataflow.admin" = concat(
+      [module.service-account-orch.iam_email],
+      var.data_eng_principals
    )
    "roles/dataflow.worker" = [
      module.service-account-df.iam_email,
    ]
+    "roles/dataflow.developer" = var.data_eng_principals
+    "roles/compute.viewer"     = var.data_eng_principals
    # network roles
    "roles/compute.networkUser" = [
      module.service-account-df.iam_email,
@ -79,10 +78,6 @@ locals {
  }
 }

-###############################################################################
-#                                 Projects                                    #
-###############################################################################
-
 module "project" {
  source          = "../../../modules/project"
  name            = var.project_id
@ -101,6 +96,7 @@ module "project" {
    "storage.googleapis.com",
    "storage-component.googleapis.com",
  ]
+
  # additive IAM bindings avoid disrupting bindings in existing project
  iam          = var.project_create != null ? local.iam : {}
  iam_additive = var.project_create == null ? local.iam : {}
--- a/examples/data-solutions/gcs-to-bq-with-least-privileges/outputs.tf
+++ b/examples/data-solutions/gcs-to-bq-with-least-privileges/outputs.tf
@ -30,7 +30,7 @@ output "project_id" {
  value       = module.project.project_id
 }

-output "serviceaccount" {
+output "service_accounts" {
  description = "Service account."
  value = {
    bq      = module.service-account-bq.email
@ -40,36 +40,38 @@ output "serviceaccount" {
  }
 }

-output "command-01-gcs" {
+output "command_01_gcs" {
  description = "gcloud command to copy data into the created bucket impersonating the service account."
  value       = "gsutil -i ${module.service-account-landing.email} cp data-demo/* ${module.gcs-data.url}"
 }

-output "command-02-dataflow" {
+output "command_02_dataflow" {
  description = "Command to run Dataflow template impersonating the service account."
-  value       = <<EOT
-  gcloud --impersonate-service-account=${module.service-account-orch.email} dataflow jobs run test_batch_01 \
-    --gcs-location gs://dataflow-templates/latest/GCS_Text_to_BigQuery \
-    --project ${module.project.project_id} \
-    --region ${var.region} \
-    --disable-public-ips \
-    --subnetwork ${module.vpc.subnets[format("%s/%s", var.region, "subnet")].self_link} \
-    --staging-location ${module.gcs-df-tmp.url} \
-    --service-account-email ${module.service-account-df.email} \
-    ${var.cmek_encryption ? format("--dataflow-kms-key=%s", module.kms[0].key_ids.key-df) : ""} \
-    --parameters \
-javascriptTextTransformFunctionName=transform,\
-JSONPath=${module.gcs-data.url}/person_schema.json,\
-javascriptTextTransformGcsPath=${module.gcs-data.url}/person_udf.js,\
-inputFilePattern=${module.gcs-data.url}/person.csv,\
-outputTable=${module.project.project_id}:${module.bigquery-dataset.dataset_id}.${module.bigquery-dataset.tables["person"].table_id},\
-bigQueryLoadingTemporaryDirectory=${module.gcs-df-tmp.url} 
-  EOT
+  value = templatefile("${path.module}/dataflow.tftpl", {
+    sa_orch_email    = module.service-account-orch.email
+    project_id       = module.project.project_id
+    region           = var.region
+    subnet           = module.vpc.subnets["${var.region}/subnet"].self_link
+    gcs_df_stg       = format("%s/%s", module.gcs-df-tmp.url, "stg")
+    sa_df_email      = module.service-account-df.email
+    cmek_encryption  = var.cmek_encryption
+    kms_key_df       = var.cmek_encryption ? module.kms[0].key_ids.key-df : null
+    gcs_data         = module.gcs-data.url
+    data_schema_file = format("%s/%s", module.gcs-data.url, "person_schema.json")
+    data_udf_file    = format("%s/%s", module.gcs-data.url, "person_udf.js")
+    data_file        = format("%s/%s", module.gcs-data.url, "person.csv")
+    bigquery_dataset = module.bigquery-dataset.dataset_id
+    bigquery_table   = module.bigquery-dataset.tables["person"].table_id
+    gcs_df_tmp       = format("%s/%s", module.gcs-df-tmp.url, "tmp")
+  })
 }

-output "command-03-bq" {
+output "command_03_bq" {
  description = "BigQuery command to query imported data."
-  value       = <<EOT
-  bq query --project_id=${module.project.project_id} --use_legacy_sql=false 'SELECT * FROM `${module.project.project_id}.${module.bigquery-dataset.dataset_id}.${module.bigquery-dataset.tables["person"].table_id}` LIMIT 1000'"
-  EOT
+  value = templatefile("${path.module}/bigquery.tftpl", {
+    project_id       = module.project.project_id
+    bigquery_dataset = module.bigquery-dataset.dataset_id
+    bigquery_table   = module.bigquery-dataset.tables["person"].table_id
+    sql_limit        = 1000
+  })
 }
--- a/examples/data-solutions/gcs-to-bq-with-least-privileges/serviceaccounts.tf
+++ b/examples/data-solutions/gcs-to-bq-with-least-privileges/serviceaccounts.tf
@ -12,10 +12,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

-###############################################################################
-#                              Service Accounts                               #
-###############################################################################
-
 module "service-account-bq" {
  source     = "../../../modules/iam-service-account"
  project_id = module.project.project_id
--- a/examples/data-solutions/gcs-to-bq-with-least-privileges/vpc.tf
+++ b/examples/data-solutions/gcs-to-bq-with-least-privileges/vpc.tf
@ -12,10 +12,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.

-###############################################################################
-#                                   Networking                                #
-###############################################################################
-
 module "vpc" {
  source     = "../../../modules/net-vpc"
  project_id = module.project.project_id
--- a/tests/examples/data_solutions/gcs_to_bq_with_least_privileges/test_plan.py
+++ b/tests/examples/data_solutions/gcs_to_bq_with_least_privileges/test_plan.py
@ -24,4 +24,4 @@ def test_resources(e2e_plan_runner):
  "Test that plan works and the numbers of resources is as expected."
  modules, resources = e2e_plan_runner(FIXTURES_DIR)
  assert len(modules) == 11
-  assert len(resources) == 43
+  assert len(resources) == 44