Changes to gcs to bq least privilege example (#447)
* Changes to gcs to bq least privilege example * Fix 'try' on encryption variables * Fix roles * Fix tests * Use templatefile in output variables * Remove FIXME * Fix tests * Changes to gcs to bq least privilege example * Fix 'try' on encryption variables * Fix roles * Fix tests * Use templatefile in output variables * Remove FIXME * Fix tests * Merge branch 'jccb/gcs-to-bq-changes' of https://github.com/GoogleCloudPlatform/cloud-foundation-fabric into jccb/gcs-to-bq-changes * fix readme and template * fix readme * Update FIXME. Co-authored-by: Lorenzo Caggioni <lorenzo.caggioni@gmail.com> Co-authored-by: Ludovico Magnocavallo <ludomagno@google.com>
This commit is contained in:
parent
98b238ae7a
commit
5396735bc6
|
@ -3,17 +3,18 @@
|
|||
This example creates the infrastructure needed to run a [Cloud Dataflow](https://cloud.google.com/dataflow) pipeline to import data from [GCS](https://cloud.google.com/storage) to [Bigquery](https://cloud.google.com/bigquery). The example will create different service accounts with least privileges on resources. To run the pipeline, users listed in `data_eng_principals` can impersonate all those service accounts.
|
||||
|
||||
The solution will use:
|
||||
- internal IPs for GCE and Cloud Dataflow instances
|
||||
- Cloud NAT to let resources egress to the Internet, to run system updates and install packages
|
||||
- rely on [Service Account Impersonation](https://cloud.google.com/iam/docs/impersonating-service-accounts) to avoid the use of service account keys
|
||||
- Service Accounts with least privilege on each resource
|
||||
- (Optional) CMEK encription for GCS bucket, DataFlow instances and BigQuery tables
|
||||
- internal IPs for GCE and Cloud Dataflow instances
|
||||
- Cloud NAT to let resources egress to the Internet, to run system updates and install packages
|
||||
- rely on [Service Account Impersonation](https://cloud.google.com/iam/docs/impersonating-service-accounts) to avoid the use of service account keys
|
||||
- Service Accounts with least privilege on each resource
|
||||
- (Optional) CMEK encription for GCS bucket, DataFlow instances and BigQuery tables
|
||||
|
||||
The example is designed to match real-world use cases with a minimum amount of resources and some compromise listed below. It can be used as a starting point for more complex scenarios.
|
||||
The example is designed to match real-world use cases with a minimum amount of resources and some compromises listed below. It can be used as a starting point for more complex scenarios.
|
||||
|
||||
This is the high level diagram:
|
||||
|
||||
![GCS to Biquery High-level diagram](diagram.png "GCS to Biquery High-level diagram")
|
||||
|
||||
## Move to real use case consideration
|
||||
In the example we implemented some compromise to keep the example minimal and easy to read. On a real word use case, you may evaluate the option to:
|
||||
- Configure a Shared-VPC
|
||||
|
@ -93,13 +94,13 @@ We need to create 3 file:
|
|||
- A `person_udf.js` containing the UDF javascript file used by the Dataflow template.
|
||||
- A `person_schema.json` file containing the table schema used to import the CSV.
|
||||
|
||||
You can find an example of those file in the folder `./data-demo`. You can copy the example files in the GCS bucket using the command returned in the terraform output as `command-01-gcs`. Below an example:
|
||||
You can find an example of those file in the folder `./data-demo`. You can copy the example files in the GCS bucket using the command returned in the terraform output as `command_01_gcs`. Below an example:
|
||||
|
||||
```bash
|
||||
gsutil -i gcs-landing@PROJECT.iam.gserviceaccount.com cp data-demo/* gs://LANDING_BUCKET
|
||||
```
|
||||
|
||||
We can now run the Dataflow pipeline using the `gcloud` returned in the terraform output as `command-02-dataflow`. Below an example:
|
||||
We can now run the Dataflow pipeline using the `gcloud` returned in the terraform output as `command_02_dataflow`. Below an example:
|
||||
|
||||
```bash
|
||||
gcloud --impersonate-service-account=orch-test@PROJECT.iam.gserviceaccount.com dataflow jobs run test_batch_01 \
|
||||
|
@ -119,7 +120,7 @@ outputTable=PROJECT:datalake.person,\
|
|||
bigQueryLoadingTemporaryDirectory=gs://PREFIX-df-tmp
|
||||
```
|
||||
|
||||
You can check data imported into Google BigQuery using the command returned in the terraform output as `command-03-bq`. Below an example:
|
||||
You can check data imported into Google BigQuery using the command returned in the terraform output as `command_03_bq`. Below an example:
|
||||
|
||||
```
|
||||
bq query --use_legacy_sql=false 'SELECT * FROM `PROJECT.datalake.person` LIMIT 1000'
|
||||
|
@ -144,10 +145,10 @@ bq query --use_legacy_sql=false 'SELECT * FROM `PROJECT.datalake.person` LIMIT 1
|
|||
|---|---|:---:|
|
||||
| [bq_tables](outputs.tf#L15) | Bigquery Tables. | |
|
||||
| [buckets](outputs.tf#L20) | GCS bucket Cloud KMS crypto keys. | |
|
||||
| [command-01-gcs](outputs.tf#L43) | gcloud command to copy data into the created bucket impersonating the service account. | |
|
||||
| [command-02-dataflow](outputs.tf#L48) | Command to run Dataflow template impersonating the service account. | |
|
||||
| [command-03-bq](outputs.tf#L70) | BigQuery command to query imported data. | |
|
||||
| [command_01_gcs](outputs.tf#L43) | gcloud command to copy data into the created bucket impersonating the service account. | |
|
||||
| [command_02_dataflow](outputs.tf#L48) | Command to run Dataflow template impersonating the service account. | |
|
||||
| [command_03_bq](outputs.tf#L69) | BigQuery command to query imported data. | |
|
||||
| [project_id](outputs.tf#L28) | Project id. | |
|
||||
| [serviceaccount](outputs.tf#L33) | Service account. | |
|
||||
| [service_accounts](outputs.tf#L33) | Service account. | |
|
||||
|
||||
<!-- END TFDOC -->
|
||||
|
|
|
@ -0,0 +1,2 @@
|
|||
bq query --project_id=${project_id} --use_legacy_sql=false \
|
||||
'SELECT * FROM `${project_id}.${bigquery_dataset}.${bigquery_table}` LIMIT ${sql_limit}'
|
|
@ -0,0 +1,18 @@
|
|||
gcloud \
|
||||
--impersonate-service-account=${sa_orch_email} \
|
||||
dataflow jobs run test_batch_01 \
|
||||
--gcs-location gs://dataflow-templates/latest/GCS_Text_to_BigQuery \
|
||||
--project ${project_id} \
|
||||
--region ${region} \
|
||||
--disable-public-ips \
|
||||
--subnetwork ${subnet} \
|
||||
--staging-location ${gcs_df_stg} \
|
||||
--service-account-email ${sa_df_email} \
|
||||
%{ if cmek_encryption }--dataflow-kms-key=${kms_key_df} %{ endif } \
|
||||
--parameters \
|
||||
javascriptTextTransformFunctionName=transform,\
|
||||
JSONPath=${data_schema_file},\
|
||||
javascriptTextTransformGcsPath=${data_udf_file},\
|
||||
inputFilePattern=${data_file},\
|
||||
outputTable=${project_id}:${bigquery_dataset}.${bigquery_table},\
|
||||
bigQueryLoadingTemporaryDirectory=${gcs_df_tmp}
|
|
@ -12,10 +12,6 @@
|
|||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
###############################################################################
|
||||
# GCS #
|
||||
###############################################################################
|
||||
|
||||
module "gcs-data" {
|
||||
source = "../../../modules/gcs"
|
||||
project_id = module.project.project_id
|
||||
|
@ -23,7 +19,7 @@ module "gcs-data" {
|
|||
name = "data"
|
||||
location = var.region
|
||||
storage_class = "REGIONAL"
|
||||
encryption_key = var.cmek_encryption ? try(module.kms[0].keys.key-gcs.id, null) : null
|
||||
encryption_key = var.cmek_encryption ? module.kms[0].keys.key-gcs.id : null
|
||||
force_destroy = true
|
||||
}
|
||||
|
||||
|
@ -34,22 +30,20 @@ module "gcs-df-tmp" {
|
|||
name = "df-tmp"
|
||||
location = var.region
|
||||
storage_class = "REGIONAL"
|
||||
encryption_key = var.cmek_encryption ? try(module.kms[0].keys.key-gcs.id, null) : null
|
||||
encryption_key = var.cmek_encryption ? module.kms[0].keys.key-gcs.id : null
|
||||
force_destroy = true
|
||||
}
|
||||
|
||||
###############################################################################
|
||||
# BQ #
|
||||
###############################################################################
|
||||
|
||||
module "bigquery-dataset" {
|
||||
source = "../../../modules/bigquery-dataset"
|
||||
project_id = module.project.project_id
|
||||
id = "datalake"
|
||||
location = var.region
|
||||
# Define Tables in Terraform for the porpuse of the example.
|
||||
# Probably in a production environment you would handle Tables creation in a
|
||||
# separate Terraform State or using a different tool/pipeline (for example: Dataform).
|
||||
|
||||
# Note: we define tables in Terraform for the purpose of this
|
||||
# example. A production environment would probably handle table
|
||||
# creation in a separate terraform pipeline or using a different
|
||||
# tool (for example: Dataform)
|
||||
tables = {
|
||||
person = {
|
||||
friendly_name = "Person. Dataflow import."
|
||||
|
@ -64,10 +58,10 @@ module "bigquery-dataset" {
|
|||
deletion_protection = false
|
||||
options = {
|
||||
clustering = null
|
||||
encryption_key = var.cmek_encryption ? try(module.kms[0].keys.key-bq.id, null) : null
|
||||
encryption_key = var.cmek_encryption ? module.kms[0].keys.key-bq.id : null
|
||||
expiration_time = null
|
||||
}
|
||||
}
|
||||
}
|
||||
encryption_key = var.cmek_encryption ? try(module.kms[0].keys.key-bq.id, null) : null
|
||||
encryption_key = var.cmek_encryption ? module.kms[0].keys.key-bq.id : null
|
||||
}
|
||||
|
|
|
@ -17,7 +17,7 @@ module "kms" {
|
|||
source = "../../../modules/kms"
|
||||
project_id = module.project.project_id
|
||||
keyring = {
|
||||
name = "${var.prefix}-keyring",
|
||||
name = "${var.prefix}-keyring"
|
||||
location = var.region
|
||||
}
|
||||
keys = {
|
||||
|
|
|
@ -58,19 +58,18 @@ locals {
|
|||
module.service-account-orch.iam_email,
|
||||
]
|
||||
"roles/iam.serviceAccountTokenCreator" = concat(
|
||||
var.data_eng_principals,
|
||||
)
|
||||
"roles/viewer" = concat(
|
||||
var.data_eng_principals
|
||||
)
|
||||
# Dataflow roles
|
||||
"roles/dataflow.admin" = concat([
|
||||
module.service-account-orch.iam_email,
|
||||
], var.data_eng_principals
|
||||
"roles/dataflow.admin" = concat(
|
||||
[module.service-account-orch.iam_email],
|
||||
var.data_eng_principals
|
||||
)
|
||||
"roles/dataflow.worker" = [
|
||||
module.service-account-df.iam_email,
|
||||
]
|
||||
"roles/dataflow.developer" = var.data_eng_principals
|
||||
"roles/compute.viewer" = var.data_eng_principals
|
||||
# network roles
|
||||
"roles/compute.networkUser" = [
|
||||
module.service-account-df.iam_email,
|
||||
|
@ -79,10 +78,6 @@ locals {
|
|||
}
|
||||
}
|
||||
|
||||
###############################################################################
|
||||
# Projects #
|
||||
###############################################################################
|
||||
|
||||
module "project" {
|
||||
source = "../../../modules/project"
|
||||
name = var.project_id
|
||||
|
@ -101,6 +96,7 @@ module "project" {
|
|||
"storage.googleapis.com",
|
||||
"storage-component.googleapis.com",
|
||||
]
|
||||
|
||||
# additive IAM bindings avoid disrupting bindings in existing project
|
||||
iam = var.project_create != null ? local.iam : {}
|
||||
iam_additive = var.project_create == null ? local.iam : {}
|
||||
|
|
|
@ -30,7 +30,7 @@ output "project_id" {
|
|||
value = module.project.project_id
|
||||
}
|
||||
|
||||
output "serviceaccount" {
|
||||
output "service_accounts" {
|
||||
description = "Service account."
|
||||
value = {
|
||||
bq = module.service-account-bq.email
|
||||
|
@ -40,36 +40,38 @@ output "serviceaccount" {
|
|||
}
|
||||
}
|
||||
|
||||
output "command-01-gcs" {
|
||||
output "command_01_gcs" {
|
||||
description = "gcloud command to copy data into the created bucket impersonating the service account."
|
||||
value = "gsutil -i ${module.service-account-landing.email} cp data-demo/* ${module.gcs-data.url}"
|
||||
}
|
||||
|
||||
output "command-02-dataflow" {
|
||||
output "command_02_dataflow" {
|
||||
description = "Command to run Dataflow template impersonating the service account."
|
||||
value = <<EOT
|
||||
gcloud --impersonate-service-account=${module.service-account-orch.email} dataflow jobs run test_batch_01 \
|
||||
--gcs-location gs://dataflow-templates/latest/GCS_Text_to_BigQuery \
|
||||
--project ${module.project.project_id} \
|
||||
--region ${var.region} \
|
||||
--disable-public-ips \
|
||||
--subnetwork ${module.vpc.subnets[format("%s/%s", var.region, "subnet")].self_link} \
|
||||
--staging-location ${module.gcs-df-tmp.url} \
|
||||
--service-account-email ${module.service-account-df.email} \
|
||||
${var.cmek_encryption ? format("--dataflow-kms-key=%s", module.kms[0].key_ids.key-df) : ""} \
|
||||
--parameters \
|
||||
javascriptTextTransformFunctionName=transform,\
|
||||
JSONPath=${module.gcs-data.url}/person_schema.json,\
|
||||
javascriptTextTransformGcsPath=${module.gcs-data.url}/person_udf.js,\
|
||||
inputFilePattern=${module.gcs-data.url}/person.csv,\
|
||||
outputTable=${module.project.project_id}:${module.bigquery-dataset.dataset_id}.${module.bigquery-dataset.tables["person"].table_id},\
|
||||
bigQueryLoadingTemporaryDirectory=${module.gcs-df-tmp.url}
|
||||
EOT
|
||||
value = templatefile("${path.module}/dataflow.tftpl", {
|
||||
sa_orch_email = module.service-account-orch.email
|
||||
project_id = module.project.project_id
|
||||
region = var.region
|
||||
subnet = module.vpc.subnets["${var.region}/subnet"].self_link
|
||||
gcs_df_stg = format("%s/%s", module.gcs-df-tmp.url, "stg")
|
||||
sa_df_email = module.service-account-df.email
|
||||
cmek_encryption = var.cmek_encryption
|
||||
kms_key_df = var.cmek_encryption ? module.kms[0].key_ids.key-df : null
|
||||
gcs_data = module.gcs-data.url
|
||||
data_schema_file = format("%s/%s", module.gcs-data.url, "person_schema.json")
|
||||
data_udf_file = format("%s/%s", module.gcs-data.url, "person_udf.js")
|
||||
data_file = format("%s/%s", module.gcs-data.url, "person.csv")
|
||||
bigquery_dataset = module.bigquery-dataset.dataset_id
|
||||
bigquery_table = module.bigquery-dataset.tables["person"].table_id
|
||||
gcs_df_tmp = format("%s/%s", module.gcs-df-tmp.url, "tmp")
|
||||
})
|
||||
}
|
||||
|
||||
output "command-03-bq" {
|
||||
output "command_03_bq" {
|
||||
description = "BigQuery command to query imported data."
|
||||
value = <<EOT
|
||||
bq query --project_id=${module.project.project_id} --use_legacy_sql=false 'SELECT * FROM `${module.project.project_id}.${module.bigquery-dataset.dataset_id}.${module.bigquery-dataset.tables["person"].table_id}` LIMIT 1000'"
|
||||
EOT
|
||||
value = templatefile("${path.module}/bigquery.tftpl", {
|
||||
project_id = module.project.project_id
|
||||
bigquery_dataset = module.bigquery-dataset.dataset_id
|
||||
bigquery_table = module.bigquery-dataset.tables["person"].table_id
|
||||
sql_limit = 1000
|
||||
})
|
||||
}
|
||||
|
|
|
@ -12,10 +12,6 @@
|
|||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
###############################################################################
|
||||
# Service Accounts #
|
||||
###############################################################################
|
||||
|
||||
module "service-account-bq" {
|
||||
source = "../../../modules/iam-service-account"
|
||||
project_id = module.project.project_id
|
||||
|
|
|
@ -1,3 +1,3 @@
|
|||
data_eng_principals = ["user:data-eng@domain.com"]
|
||||
project_id = "datalake-001"
|
||||
prefix = "prefix"
|
||||
project_id = "datalake-001"
|
||||
prefix = "prefix"
|
||||
|
|
|
@ -12,10 +12,6 @@
|
|||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
###############################################################################
|
||||
# Networking #
|
||||
###############################################################################
|
||||
|
||||
module "vpc" {
|
||||
source = "../../../modules/net-vpc"
|
||||
project_id = module.project.project_id
|
||||
|
|
|
@ -24,4 +24,4 @@ def test_resources(e2e_plan_runner):
|
|||
"Test that plan works and the numbers of resources is as expected."
|
||||
modules, resources = e2e_plan_runner(FIXTURES_DIR)
|
||||
assert len(modules) == 11
|
||||
assert len(resources) == 43
|
||||
assert len(resources) == 44
|
||||
|
|
Loading…
Reference in New Issue