Add gcs2bq with least privileges example

This commit is contained in:
lcaggio 2021-12-24 11:21:42 +01:00
parent bf1e2e3bed
commit 44542c8b17
15 changed files with 716 additions and 0 deletions

View File

@ -0,0 +1,113 @@
# Cloud Storage to Bigquery with Cloud Dataflow with least privileges
This example creates the infrastructure needed to run a [Cloud Dataflow](https://cloud.google.com/dataflow) pipeline to import data from [GCS](https://cloud.google.com/storage) to [Bigquery](https://cloud.google.com/bigquery). The example will create different Service Account with least privileges on resources. To run the pipeline, users listed in `data_eng_users` or `data_eng_groups` can impersonate all those Service Accounts.
The solution will use:
- internal IPs for GCE and Dataflow instances
- Cloud NAT to let resources comunicate to the Internet, run system updates, and install packages
- relay on Google Service Account impersonification to better split roles
- Service Account with least privilege on each resources
The example is designed to match real-world use cases with a minimum amount of resources. It can be used as a starting point for more complex scenarios.
This is the high level diagram:
![GCS to Biquery High-level diagram](diagram.png "GCS to Biquery High-level diagram")
## Managed resources and services
This sample creates several distinct groups of resources:
- projects
- Service Project configured for GCS buckets, Dataflow instances and BigQuery tables and orchestration
- networking
- VPC network
- One subnet
- Firewall rules for [SSH access via IAP](https://cloud.google.com/iap/docs/using-tcp-forwarding) and open communication within the VPC
- IAM
- One service account for uploading data into the GCS landing bucket
- One service account for Orchestration
- One service account for Dataflow instances
- One service account for Bigquery tables
- GCS
- One bucket
- BQ
- One dataset
In this example you can also configure users or group of user to assign them viewer role on the resources created and the ability to imprsonate service accounts to test dataflow pipelines before autometing them with Composer or any other orchestration systems.
## Deploy your enviroment
Run Terraform init:
```
$ terraform init
```
Configure the Terraform variable in your `terraform.tfvars` file. You need to spefify at least the following variables:
```
billing_account = "0011322-334455-667788"
root_node = "folders/123456789012"
project_name = "test-demo-tf-001"
data_eng_users = ["your_email@domani.example"]
```
You can run now:
```
$ terraform apply
```
You should see the output of the Terraform script with resources created and some command pre-created for you to run the example following steps below.
## Test your environment with Cloud Dataflow
We assume all those steps are run using a user listed on `data_eng_users` or `data_eng_groups`. You can authenticate as the user using the following command:
```
$ gcloud init
```
For the purpose of the example we will import from GCS to Bigquery a CSV file with the following structure:
```
name,surname,timestam
```
We need to create 3 file:
- A `person.csv` file containing your data in the form `name,surname,timestam`. Here an example line `Lorenzo,Caggioni,1637771951'.
- A `person_udf.js` containing the UDF javascript file used by the Dataflow template.
- A `person_schema.json` file containing the table schema used to import the CSV.
You can find an example of those file in the folder `./data-demo`. You can copy the example files in the GCS bucket using the command returned in the terraform output as `command-01-gcs`.
```bash
gsutil -i gcs-landing@PROJECT.iam.gserviceaccount.com cp data-demo/* gs://LANDING_BUCKET
```
We can now run the Dataflow pipeline using the `gcloud` returned in the terraform output as `command-03-dataflow`.
```bash
gcloud --impersonate-service-account=orch-test@PROJECT.iam.gserviceaccount.com dataflow jobs run test_batch_01 \
--gcs-location gs://dataflow-templates/latest/GCS_Text_to_BigQuery \
--project PROJECT \
--region REGION \
--disable-public-ips \
--subnetwork https://www.googleapis.com/compute/v1/projects/PROJECT/regions/REGION/subnetworks/subnet \
--staging-location gs://PROJECT-eu-df-tmplocation \
--service-account-email df-test@PROJECT.iam.gserviceaccount.com \
--parameters \
javascriptTextTransformFunctionName=transform,\
JSONPath=gs://PROJECT-eu-data/person_schema.json,\
javascriptTextTransformGcsPath=gs://PROJECT-eu-data/person_udf.js,\
inputFilePattern=gs://PROJECT-eu-data/person.csv,\
outputTable=PROJECT:bq_dataset.person,\
bigQueryLoadingTemporaryDirectory=gs://PROJECT-eu-df-tmplocation
```
You can check data imported into Google BigQuery using the command returned in the terraform output as `command-03-bq`:
```
bq query --use_legacy_sql=false 'SELECT * FROM `PROJECT.datalake.person` LIMIT 1000'
```

View File

@ -0,0 +1,30 @@
# Copyright 2021 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# The `impersonate_service_account` option require the identity launching terraform
# role `roles/iam.serviceAccountTokenCreator` on the Service Account specified.
terraform {
backend "gcs" {
bucket = "BUCKET_NAME"
prefix = "PREFIX"
impersonate_service_account = "SERVICE_ACCOUNT@PROJECT_ID.iam.gserviceaccount.com"
}
}
provider "google" {
impersonate_service_account = "SERVICE_ACCOUNT@PROJECT_ID.iam.gserviceaccount.com"
}
provider "google-beta" {
impersonate_service_account = "SERVICE_ACCOUNT@PROJECT_ID.iam.gserviceaccount.com"
}

View File

@ -0,0 +1,11 @@
Lorenzo,Caggioni,1637771951
Lorenzo,Caggioni,1637771952
Lorenzo,Caggioni,1637771953
Lorenzo,Caggioni,1637771954
Lorenzo,Caggioni,1637771955
Lorenzo,Caggioni,1637771956
Lorenzo,Caggioni,1637771957
Lorenzo,Caggioni,1637771958
Lorenzo,Caggioni,1637771959
Lorenzo,Caggioni,1637771910
Lorenzo,Caggioni,1637771911
1 Lorenzo Caggioni 1637771951
2 Lorenzo Caggioni 1637771952
3 Lorenzo Caggioni 1637771953
4 Lorenzo Caggioni 1637771954
5 Lorenzo Caggioni 1637771955
6 Lorenzo Caggioni 1637771956
7 Lorenzo Caggioni 1637771957
8 Lorenzo Caggioni 1637771958
9 Lorenzo Caggioni 1637771959
10 Lorenzo Caggioni 1637771910
11 Lorenzo Caggioni 1637771911

View File

@ -0,0 +1,16 @@
{
"BigQuery Schema": [
{
"name": "name",
"type": "STRING"
},
{
"name": "surname",
"type": "STRING"
},
{
"name": "timestamp",
"type": "TIMESTAMP"
}
]
}

View File

@ -0,0 +1,11 @@
function transform(line) {
var values = line.split(',');
var obj = new Object();
obj.name = values[0];
obj.surname = values[1];
obj.timestamp = values[2];
var jsonString = JSON.stringify(obj);
return jsonString;
}

Binary file not shown.

After

Width:  |  Height:  |  Size: 70 KiB

View File

@ -0,0 +1,245 @@
# Copyright 2021 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
locals {
data_eng_users_iam = [
for item in var.data_eng_users :
"user:${item}"
]
data_eng_groups_iam = [
for item in var.data_eng_groups :
"group:${item}"
]
}
###############################################################################
# Projects - Centralized #
###############################################################################
module "project-service" {
source = "../../modules/project"
name = var.project_name
parent = var.root_node
billing_account = var.billing_account
services = [
"compute.googleapis.com",
"servicenetworking.googleapis.com",
"storage-component.googleapis.com",
"bigquery.googleapis.com",
"bigquerystorage.googleapis.com",
"bigqueryreservation.googleapis.com",
"dataflow.googleapis.com",
]
iam = {
# GCS roles
"roles/storage.objectAdmin" = [
module.service-account-df.iam_email,
module.service-account-landing.iam_email
],
"roles/storage.objectViewer" = [
module.service-account-orch.iam_email,
],
#Bigquery roles
"roles/bigquery.admin" = [
module.service-account-orch.iam_email,
]
"roles/bigquery.dataEditor" = [
module.service-account-df.iam_email,
]
"roles/bigquery.dataViewer" = [
module.service-account-bq.iam_email,
module.service-account-orch.iam_email
]
"roles/bigquery.jobUser" = [
module.service-account-df.iam_email
]
"roles/bigquery.user" = [
module.service-account-bq.iam_email,
module.service-account-df.iam_email
]
#Common roles
"roles/logging.logWriter" = [
module.service-account-bq.iam_email,
module.service-account-landing.iam_email,
module.service-account-orch.iam_email,
]
"roles/monitoring.metricWriter" = [
module.service-account-bq.iam_email,
module.service-account-landing.iam_email,
module.service-account-orch.iam_email,
]
"roles/iam.serviceAccountUser" = [
module.service-account-orch.iam_email,
]
"roles/iam.serviceAccountTokenCreator" = concat(
local.data_eng_users_iam,
local.data_eng_groups_iam
)
"roles/viewer" = concat(
local.data_eng_users_iam,
local.data_eng_groups_iam
)
#Dataflow roles
"roles/dataflow.admin" = [
module.service-account-orch.iam_email,
]
}
oslogin = true
}
###############################################################################
# Project Service Accounts #
###############################################################################
module "service-account-bq" {
source = "../../modules/iam-service-account"
project_id = module.project-service.project_id
name = "bq-datalake"
iam = {
"roles/iam.serviceAccountTokenCreator" = concat(
local.data_eng_users_iam,
local.data_eng_groups_iam
)
}
}
module "service-account-landing" {
source = "../../modules/iam-service-account"
project_id = module.project-service.project_id
name = "gcs-landing"
iam = {
"roles/iam.serviceAccountTokenCreator" = concat(
local.data_eng_users_iam,
local.data_eng_groups_iam
)
}
}
module "service-account-orch" {
source = "../../modules/iam-service-account"
project_id = module.project-service.project_id
name = "orchestrator"
iam = {
"roles/iam.serviceAccountTokenCreator" = concat(
local.data_eng_users_iam,
local.data_eng_groups_iam
)
}
}
module "service-account-df" {
source = "../../modules/iam-service-account"
project_id = module.project-service.project_id
name = "df-loading"
iam_project_roles = {
(var.project_name) = [
"roles/dataflow.worker",
"roles/bigquery.dataOwner",
"roles/bigquery.metadataViewer",
"roles/storage.objectViewer",
"roles/bigquery.jobUser"
]
}
iam = {
"roles/iam.serviceAccountTokenCreator" = concat(
local.data_eng_users_iam,
local.data_eng_groups_iam
),
"roles/iam.serviceAccountUser" = concat(
[module.service-account-orch.iam_email],
local.data_eng_users_iam,
local.data_eng_groups_iam
)
}
}
###############################################################################
# Networking #
###############################################################################
module "vpc" {
source = "../../modules/net-vpc"
project_id = module.project-service.project_id
name = var.vpc_name
subnets = [
{
ip_cidr_range = var.vpc_ip_cidr_range
name = var.vpc_subnet_name
region = var.region
secondary_ip_range = {}
}
]
}
module "vpc-firewall" {
source = "../../modules/net-vpc-firewall"
project_id = module.project-service.project_id
network = module.vpc.name
admin_ranges = [var.vpc_ip_cidr_range]
}
module "nat" {
source = "../../modules/net-cloudnat"
project_id = module.project-service.project_id
region = var.region
name = "default"
router_network = module.vpc.name
}
###############################################################################
# GCS #
###############################################################################
module "gcs-01" {
source = "../../modules/gcs"
for_each = toset(["data-landing", "df-tmplocation"])
project_id = module.project-service.project_id
prefix = module.project-service.project_id
name = each.key
force_destroy = true
}
# module "gcs-02" {
# source = "../../modules/gcs-demo"
# project_id = module.project-service.project_id
# prefix = module.project-service.project_id
# name = "test-region"
# location = "europe-west1"
# storage_class = "REGIONAL"
# force_destroy = true
# }
###############################################################################
# BQ #
###############################################################################
module "bigquery-dataset" {
source = "../../modules/bigquery-dataset"
project_id = module.project-service.project_id
id = "datalake"
tables = {
person = {
friendly_name = "Person. Dataflow import."
labels = {}
options = null
partitioning = {
field = null
range = null # use start/end/interval for range
time = null
}
schema = file("${path.module}/person.json")
deletion_protection = false
}
}
}

View File

@ -0,0 +1,73 @@
# Copyright 2021 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
output "bq_tables" {
description = "Bigquery Tables."
value = module.bigquery-dataset.table_ids
}
output "buckets" {
description = "GCS Bucket Cloud KMS crypto keys."
value = {
for name, bucket in module.gcs-01 :
bucket.name => bucket.url
}
}
output "projects" {
description = "Project ids."
value = {
service-project = module.project-service.project_id
}
}
output "serviceaccount" {
description = "Service Account."
value = {
bq = module.service-account-bq.email
df = module.service-account-df.email
orch = module.service-account-orch.email
}
}
output "command-01-gcs" {
description = "gcloud command to copy data into the created bucket impersonating the service account."
value = "gsutil -i ${module.service-account-landing.email} cp data-demo/* ${module.gcs-01["data-landing"].url}"
}
output "command-02-dataflow" {
description = "gcloud command to run dataflow template impersonating the service account."
value = <<EOT
gcloud --impersonate-service-account=${module.service-account-orch.email} dataflow jobs run test_batch_01 \
--gcs-location gs://dataflow-templates/latest/GCS_Text_to_BigQuery \
--project ${module.project-service.project_id} \
--region ${var.region} \
--disable-public-ips \
--subnetwork ${module.vpc.subnets[format("%s/%s", var.region, "subnet")].self_link} \
--staging-location ${module.gcs-01["df-tmplocation"].url} \
--service-account-email ${module.service-account-df.email} \
--parameters \
javascriptTextTransformFunctionName=transform,\
JSONPath=${module.gcs-01["data-landing"].url}/person_schema.json,\
javascriptTextTransformGcsPath=${module.gcs-01["data-landing"].url}/person_udf.js,\
inputFilePattern=${module.gcs-01["data-landing"].url}/person.csv,\
outputTable=${module.project-service.project_id}:${module.bigquery-dataset.dataset_id}.${module.bigquery-dataset.tables["person"].table_id},\
bigQueryLoadingTemporaryDirectory=${module.gcs-01["df-tmplocation"].url}
EOT
}
output "command-03-bq" {
description = "bq command to query imported data."
value = "bq query --use_legacy_sql=false 'SELECT * FROM `${module.project-service.project_id}.${module.bigquery-dataset.dataset_id}.${module.bigquery-dataset.tables["person"].table_id}` LIMIT 1000'"
}

View File

@ -0,0 +1,17 @@
[
{
"mode": "NULLABLE",
"name": "name",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "surname",
"type": "STRING"
},
{
"mode": "NULLABLE",
"name": "timestamp",
"type": "TIMESTAMP"
}
]

View File

@ -0,0 +1,77 @@
# Copyright 2021 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
variable "billing_account" {
description = "Billing account id used as default for new projects."
type = string
}
variable "location" {
description = "The location where resources will be deployed."
type = string
default = "europe"
}
variable "project_name" {
description = "Name for the new Service Project."
type = string
}
variable "region" {
description = "The region where resources will be deployed."
type = string
default = "europe-west1"
}
variable "root_node" {
description = "The resource name of the parent Folder or Organization. Must be of the form folders/folder_id or organizations/org_id."
type = string
}
variable "ssh_source_ranges" {
description = "IP CIDR ranges that will be allowed to connect via SSH to the onprem instance."
type = list(string)
default = ["0.0.0.0/0"]
}
variable "data_eng_groups" {
description = "Groups with Service Account Tocken creator role on service accounts in the form 'USER/GROUP_EMAIL'."
type = list(string)
default = []
}
variable "data_eng_users" {
description = "Users with Service Account Tocken creator role on service accounts in the form 'USER/GROUP_EMAIL'."
type = list(string)
default = []
}
variable "vpc_ip_cidr_range" {
description = "Ip range used in the subnet deployef in the Service Project."
type = string
default = "10.0.0.0/20"
}
variable "vpc_name" {
description = "Name of the VPC created in the Service Project."
type = string
default = "local"
}
variable "vpc_subnet_name" {
description = "Name of the subnet created in the Service Project."
type = string
default = "subnet"
}

View File

@ -0,0 +1,29 @@
# Copyright 2021 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# https://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
terraform {
required_version = ">= 1.0.0"
required_providers {
google = {
source = "hashicorp/google"
version = ">= 4.0.0"
}
google-beta = {
source = "hashicorp/google-beta"
version = ">= 4.0.0"
}
}
}

View File

@ -0,0 +1,13 @@
# Copyright 2021 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

View File

@ -0,0 +1,22 @@
/**
* Copyright 2021 Google LLC
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
module "test" {
source = "../../../../data-solutions/gcs-to-bq-with-least-privileges/"
billing_account = var.billing_account
root_node = var.root_node
project_name = var.project_name
}

View File

@ -0,0 +1,32 @@
/**
* Copyright 2021 Google LLC
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
variable "billing_account" {
type = string
default = "123456-123456-123456"
}
variable "project_name" {
description = "The project name."
type = string
default = "gcs2bq-least-privileges"
}
variable "root_node" {
description = "The resource name of the parent Folder or Organization. Must be of the form folders/folder_id or organizations/org_id."
type = string
default = "folders/12345678"
}

View File

@ -0,0 +1,27 @@
# Copyright 2021 Google LLC
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import pytest
FIXTURES_DIR = os.path.join(os.path.dirname(__file__), 'fixture')
def test_resources(e2e_plan_runner):
"Test that plan works and the numbers of resources is as expected."
modules, resources = e2e_plan_runner(FIXTURES_DIR)
assert len(modules) == 11
assert len(resources) == 49