Changes to gcs to bq least privilege example (#447)

* Changes to gcs to bq least privilege example

* Fix 'try' on encryption variables

* Fix roles

* Fix tests

* Use templatefile in output variables

* Remove FIXME

* Fix tests

* Changes to gcs to bq least privilege example

* Fix 'try' on encryption variables

* Fix roles

* Fix tests

* Use templatefile in output variables

* Remove FIXME

* Fix tests

* Merge branch 'jccb/gcs-to-bq-changes' of https://github.com/GoogleCloudPlatform/cloud-foundation-fabric into jccb/gcs-to-bq-changes

* fix readme and template

* fix readme

* Update FIXME.

Co-authored-by: Lorenzo Caggioni <lorenzo.caggioni@gmail.com>
Co-authored-by: Ludovico Magnocavallo <ludomagno@google.com>
This commit is contained in:
Julio Castillo 2022-02-02 08:32:59 +01:00 committed by GitHub
parent 98b238ae7a
commit 5396735bc6
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
11 changed files with 81 additions and 76 deletions

View File

@ -3,17 +3,18 @@
This example creates the infrastructure needed to run a [Cloud Dataflow](https://cloud.google.com/dataflow) pipeline to import data from [GCS](https://cloud.google.com/storage) to [Bigquery](https://cloud.google.com/bigquery). The example will create different service accounts with least privileges on resources. To run the pipeline, users listed in `data_eng_principals` can impersonate all those service accounts.
The solution will use:
- internal IPs for GCE and Cloud Dataflow instances
- Cloud NAT to let resources egress to the Internet, to run system updates and install packages
- rely on [Service Account Impersonation](https://cloud.google.com/iam/docs/impersonating-service-accounts) to avoid the use of service account keys
- Service Accounts with least privilege on each resource
- (Optional) CMEK encription for GCS bucket, DataFlow instances and BigQuery tables
- internal IPs for GCE and Cloud Dataflow instances
- Cloud NAT to let resources egress to the Internet, to run system updates and install packages
- rely on [Service Account Impersonation](https://cloud.google.com/iam/docs/impersonating-service-accounts) to avoid the use of service account keys
- Service Accounts with least privilege on each resource
- (Optional) CMEK encription for GCS bucket, DataFlow instances and BigQuery tables
The example is designed to match real-world use cases with a minimum amount of resources and some compromise listed below. It can be used as a starting point for more complex scenarios.
The example is designed to match real-world use cases with a minimum amount of resources and some compromises listed below. It can be used as a starting point for more complex scenarios.
This is the high level diagram:
![GCS to Biquery High-level diagram](diagram.png "GCS to Biquery High-level diagram")
## Move to real use case consideration
In the example we implemented some compromise to keep the example minimal and easy to read. On a real word use case, you may evaluate the option to:
- Configure a Shared-VPC
@ -93,13 +94,13 @@ We need to create 3 file:
- A `person_udf.js` containing the UDF javascript file used by the Dataflow template.
- A `person_schema.json` file containing the table schema used to import the CSV.
You can find an example of those file in the folder `./data-demo`. You can copy the example files in the GCS bucket using the command returned in the terraform output as `command-01-gcs`. Below an example:
You can find an example of those file in the folder `./data-demo`. You can copy the example files in the GCS bucket using the command returned in the terraform output as `command_01_gcs`. Below an example:
```bash
gsutil -i gcs-landing@PROJECT.iam.gserviceaccount.com cp data-demo/* gs://LANDING_BUCKET
```
We can now run the Dataflow pipeline using the `gcloud` returned in the terraform output as `command-02-dataflow`. Below an example:
We can now run the Dataflow pipeline using the `gcloud` returned in the terraform output as `command_02_dataflow`. Below an example:
```bash
gcloud --impersonate-service-account=orch-test@PROJECT.iam.gserviceaccount.com dataflow jobs run test_batch_01 \
@ -119,7 +120,7 @@ outputTable=PROJECT:datalake.person,\
bigQueryLoadingTemporaryDirectory=gs://PREFIX-df-tmp
```
You can check data imported into Google BigQuery using the command returned in the terraform output as `command-03-bq`. Below an example:
You can check data imported into Google BigQuery using the command returned in the terraform output as `command_03_bq`. Below an example:
```
bq query --use_legacy_sql=false 'SELECT * FROM `PROJECT.datalake.person` LIMIT 1000'
@ -144,10 +145,10 @@ bq query --use_legacy_sql=false 'SELECT * FROM `PROJECT.datalake.person` LIMIT 1
|---|---|:---:|
| [bq_tables](outputs.tf#L15) | Bigquery Tables. | |
| [buckets](outputs.tf#L20) | GCS bucket Cloud KMS crypto keys. | |
| [command-01-gcs](outputs.tf#L43) | gcloud command to copy data into the created bucket impersonating the service account. | |
| [command-02-dataflow](outputs.tf#L48) | Command to run Dataflow template impersonating the service account. | |
| [command-03-bq](outputs.tf#L70) | BigQuery command to query imported data. | |
| [command_01_gcs](outputs.tf#L43) | gcloud command to copy data into the created bucket impersonating the service account. | |
| [command_02_dataflow](outputs.tf#L48) | Command to run Dataflow template impersonating the service account. | |
| [command_03_bq](outputs.tf#L69) | BigQuery command to query imported data. | |
| [project_id](outputs.tf#L28) | Project id. | |
| [serviceaccount](outputs.tf#L33) | Service account. | |
| [service_accounts](outputs.tf#L33) | Service account. | |
<!-- END TFDOC -->

View File

@ -0,0 +1,2 @@
bq query --project_id=${project_id} --use_legacy_sql=false \
'SELECT * FROM `${project_id}.${bigquery_dataset}.${bigquery_table}` LIMIT ${sql_limit}'

View File

@ -0,0 +1,18 @@
gcloud \
--impersonate-service-account=${sa_orch_email} \
dataflow jobs run test_batch_01 \
--gcs-location gs://dataflow-templates/latest/GCS_Text_to_BigQuery \
--project ${project_id} \
--region ${region} \
--disable-public-ips \
--subnetwork ${subnet} \
--staging-location ${gcs_df_stg} \
--service-account-email ${sa_df_email} \
%{ if cmek_encryption }--dataflow-kms-key=${kms_key_df} %{ endif } \
--parameters \
javascriptTextTransformFunctionName=transform,\
JSONPath=${data_schema_file},\
javascriptTextTransformGcsPath=${data_udf_file},\
inputFilePattern=${data_file},\
outputTable=${project_id}:${bigquery_dataset}.${bigquery_table},\
bigQueryLoadingTemporaryDirectory=${gcs_df_tmp}

View File

@ -12,10 +12,6 @@
# See the License for the specific language governing permissions and
# limitations under the License.
###############################################################################
# GCS #
###############################################################################
module "gcs-data" {
source = "../../../modules/gcs"
project_id = module.project.project_id
@ -23,7 +19,7 @@ module "gcs-data" {
name = "data"
location = var.region
storage_class = "REGIONAL"
encryption_key = var.cmek_encryption ? try(module.kms[0].keys.key-gcs.id, null) : null
encryption_key = var.cmek_encryption ? module.kms[0].keys.key-gcs.id : null
force_destroy = true
}
@ -34,22 +30,20 @@ module "gcs-df-tmp" {
name = "df-tmp"
location = var.region
storage_class = "REGIONAL"
encryption_key = var.cmek_encryption ? try(module.kms[0].keys.key-gcs.id, null) : null
encryption_key = var.cmek_encryption ? module.kms[0].keys.key-gcs.id : null
force_destroy = true
}
###############################################################################
# BQ #
###############################################################################
module "bigquery-dataset" {
source = "../../../modules/bigquery-dataset"
project_id = module.project.project_id
id = "datalake"
location = var.region
# Define Tables in Terraform for the porpuse of the example.
# Probably in a production environment you would handle Tables creation in a
# separate Terraform State or using a different tool/pipeline (for example: Dataform).
# Note: we define tables in Terraform for the purpose of this
# example. A production environment would probably handle table
# creation in a separate terraform pipeline or using a different
# tool (for example: Dataform)
tables = {
person = {
friendly_name = "Person. Dataflow import."
@ -64,10 +58,10 @@ module "bigquery-dataset" {
deletion_protection = false
options = {
clustering = null
encryption_key = var.cmek_encryption ? try(module.kms[0].keys.key-bq.id, null) : null
encryption_key = var.cmek_encryption ? module.kms[0].keys.key-bq.id : null
expiration_time = null
}
}
}
encryption_key = var.cmek_encryption ? try(module.kms[0].keys.key-bq.id, null) : null
encryption_key = var.cmek_encryption ? module.kms[0].keys.key-bq.id : null
}

View File

@ -17,7 +17,7 @@ module "kms" {
source = "../../../modules/kms"
project_id = module.project.project_id
keyring = {
name = "${var.prefix}-keyring",
name = "${var.prefix}-keyring"
location = var.region
}
keys = {

View File

@ -58,19 +58,18 @@ locals {
module.service-account-orch.iam_email,
]
"roles/iam.serviceAccountTokenCreator" = concat(
var.data_eng_principals,
)
"roles/viewer" = concat(
var.data_eng_principals
)
# Dataflow roles
"roles/dataflow.admin" = concat([
module.service-account-orch.iam_email,
], var.data_eng_principals
"roles/dataflow.admin" = concat(
[module.service-account-orch.iam_email],
var.data_eng_principals
)
"roles/dataflow.worker" = [
module.service-account-df.iam_email,
]
"roles/dataflow.developer" = var.data_eng_principals
"roles/compute.viewer" = var.data_eng_principals
# network roles
"roles/compute.networkUser" = [
module.service-account-df.iam_email,
@ -79,10 +78,6 @@ locals {
}
}
###############################################################################
# Projects #
###############################################################################
module "project" {
source = "../../../modules/project"
name = var.project_id
@ -101,6 +96,7 @@ module "project" {
"storage.googleapis.com",
"storage-component.googleapis.com",
]
# additive IAM bindings avoid disrupting bindings in existing project
iam = var.project_create != null ? local.iam : {}
iam_additive = var.project_create == null ? local.iam : {}

View File

@ -30,7 +30,7 @@ output "project_id" {
value = module.project.project_id
}
output "serviceaccount" {
output "service_accounts" {
description = "Service account."
value = {
bq = module.service-account-bq.email
@ -40,36 +40,38 @@ output "serviceaccount" {
}
}
output "command-01-gcs" {
output "command_01_gcs" {
description = "gcloud command to copy data into the created bucket impersonating the service account."
value = "gsutil -i ${module.service-account-landing.email} cp data-demo/* ${module.gcs-data.url}"
}
output "command-02-dataflow" {
output "command_02_dataflow" {
description = "Command to run Dataflow template impersonating the service account."
value = <<EOT
gcloud --impersonate-service-account=${module.service-account-orch.email} dataflow jobs run test_batch_01 \
--gcs-location gs://dataflow-templates/latest/GCS_Text_to_BigQuery \
--project ${module.project.project_id} \
--region ${var.region} \
--disable-public-ips \
--subnetwork ${module.vpc.subnets[format("%s/%s", var.region, "subnet")].self_link} \
--staging-location ${module.gcs-df-tmp.url} \
--service-account-email ${module.service-account-df.email} \
${var.cmek_encryption ? format("--dataflow-kms-key=%s", module.kms[0].key_ids.key-df) : ""} \
--parameters \
javascriptTextTransformFunctionName=transform,\
JSONPath=${module.gcs-data.url}/person_schema.json,\
javascriptTextTransformGcsPath=${module.gcs-data.url}/person_udf.js,\
inputFilePattern=${module.gcs-data.url}/person.csv,\
outputTable=${module.project.project_id}:${module.bigquery-dataset.dataset_id}.${module.bigquery-dataset.tables["person"].table_id},\
bigQueryLoadingTemporaryDirectory=${module.gcs-df-tmp.url}
EOT
value = templatefile("${path.module}/dataflow.tftpl", {
sa_orch_email = module.service-account-orch.email
project_id = module.project.project_id
region = var.region
subnet = module.vpc.subnets["${var.region}/subnet"].self_link
gcs_df_stg = format("%s/%s", module.gcs-df-tmp.url, "stg")
sa_df_email = module.service-account-df.email
cmek_encryption = var.cmek_encryption
kms_key_df = var.cmek_encryption ? module.kms[0].key_ids.key-df : null
gcs_data = module.gcs-data.url
data_schema_file = format("%s/%s", module.gcs-data.url, "person_schema.json")
data_udf_file = format("%s/%s", module.gcs-data.url, "person_udf.js")
data_file = format("%s/%s", module.gcs-data.url, "person.csv")
bigquery_dataset = module.bigquery-dataset.dataset_id
bigquery_table = module.bigquery-dataset.tables["person"].table_id
gcs_df_tmp = format("%s/%s", module.gcs-df-tmp.url, "tmp")
})
}
output "command-03-bq" {
output "command_03_bq" {
description = "BigQuery command to query imported data."
value = <<EOT
bq query --project_id=${module.project.project_id} --use_legacy_sql=false 'SELECT * FROM `${module.project.project_id}.${module.bigquery-dataset.dataset_id}.${module.bigquery-dataset.tables["person"].table_id}` LIMIT 1000'"
EOT
value = templatefile("${path.module}/bigquery.tftpl", {
project_id = module.project.project_id
bigquery_dataset = module.bigquery-dataset.dataset_id
bigquery_table = module.bigquery-dataset.tables["person"].table_id
sql_limit = 1000
})
}

View File

@ -12,10 +12,6 @@
# See the License for the specific language governing permissions and
# limitations under the License.
###############################################################################
# Service Accounts #
###############################################################################
module "service-account-bq" {
source = "../../../modules/iam-service-account"
project_id = module.project.project_id

View File

@ -12,10 +12,6 @@
# See the License for the specific language governing permissions and
# limitations under the License.
###############################################################################
# Networking #
###############################################################################
module "vpc" {
source = "../../../modules/net-vpc"
project_id = module.project.project_id

View File

@ -24,4 +24,4 @@ def test_resources(e2e_plan_runner):
"Test that plan works and the numbers of resources is as expected."
modules, resources = e2e_plan_runner(FIXTURES_DIR)
assert len(modules) == 11
assert len(resources) == 43
assert len(resources) == 44