History

simonebruzzechesse d11c380aec Format python files in blueprints (#2079 ) * format python files in blueprints * update check on blueprints python code * update python linter in CI workflow		2024-02-15 09:37:49 +01:00
..
demo	Format python files in blueprints (#2079 )	2024-02-15 09:37:49 +01:00
images	Improve Minimal Data Platform Blueprint (#1473 )	2023-06-28 09:05:48 +02:00
01-landing.tf	[Minimal Data Platform] Fix Landing and curated IAM (#1832 )	2023-11-01 17:53:06 +01:00
02-composer.tf	versions.tf maintenance + copyright notice bump (#1782 )	2023-10-20 18:17:47 +02:00
02-dataproc.tf	Make deletion protection consistent across all modules (#1735 )	2023-10-05 17:31:07 +02:00
02-processing.tf	Make deletion protection consistent across all modules (#1735 )	2023-10-05 17:31:07 +02:00
03-curated.tf	[Minimal Data Platform] Fix Landing and curated IAM (#1832 )	2023-11-01 17:53:06 +01:00
04-common.tf	Add DLP Service Agent role	2023-11-20 15:34:27 +01:00
IAM.md	[Minimal Data Platform] Fix Landing and curated IAM (#1832 )	2023-11-01 17:53:06 +01:00
README.md	Add DLP Service Agent role	2023-11-20 15:34:27 +01:00
main.tf	versions.tf maintenance + copyright notice bump (#1782 )	2023-10-20 18:17:47 +02:00
outputs.tf	versions.tf maintenance + copyright notice bump (#1782 )	2023-10-20 18:17:47 +02:00
variables.tf	Make deletion protection consistent across all modules (#1735 )	2023-10-05 17:31:07 +02:00

README.md

Minimal Data Platform

This module implements a minimal opinionated Data Platform Architecture based on Dataproc Serverless resources. It creates and sets up projects and related resources that compose an end-to-end data environment.

This minimal Data Platform Architecture keep to a minimal set of projects the solution. The approach make the architecture easy to read and operate but limit the ability to scale to handle multiple worklaods. To better handle more complex use cases where workloads need processing role segmentation betwneed transformations or deeper cost attribution are needed, it is suggested to refer to the Data Platform blueprint.

The code is intentionally simple, as it's intended to provide a generic initial setup and then allow easy customizations to complete the implementation of the intended design.

The following diagram is a high-level reference of the resources created and managed here:

A set of demo Airflow pipelines are also part of this blueprint: they can be run on top of the foundational infrastructure to verify and test the setup.

Design overview and choices
Project structure
Roles
Service accounts
User groups
Data Anonymization
Data Catalog
How to run this script
Variable configuration
How to use this blueprint from Terraform
Customizations
Demo pipeline
Files
Variables
Outputs

Design overview and choices

Despite its simplicity, this stage implements the basics of a design that we've seen working well for various customers.

The approach adapts to different high-level requirements:

boundaries for each step
clearly defined actors
least privilege principle
rely on service account impersonation

The code in this blueprint doesn't address Organization-level configurations (Organization policy, VPC-SC, centralized logs). We expect those elements to be managed by automation stages external to this script like those in FAST and this blueprint deployed on top of them as one of the stages.

Project structure

The Data Platform is designed to rely on several projects, one project per data stage. The stages identified are:

landing
processing
curated
common

This separation into projects allows adhering to the least-privilege principle by using project-level roles.

The script will create the following projects:

Landing Data, stored in relevant formats. Structured data can be stored in BigQuery or in GCS using an appropriate file format such as AVRO or Parquet. Unstructured data stored on Cloud Storage.
Processing Used to host all resources needed to process and orchestrate data movement. Cloud Composer orchestrates all tasks that move data across layers. Cloud Dataproc Serverless process and move data between layers. Anonymization or tokenization of Personally Identifiable Information (PII) can be implemented here using Cloud DLP or a custom solution, depending on your requirements.
Curated Cleansed, aggregated and curated data.
Common Common services such as Cloud DLP or Data Catalog.

Roles

We assign roles on resources at the project level, granting the appropriate roles via groups (humans) and service accounts (services and applications) according to best practices.

Service accounts

Service account creation follows the least privilege principle, performing a single task which requires access to a defined set of resources. The table below shows a high level overview of roles for each service account on each data layer, using READ or WRITE access patterns for simplicity.

A full reference of IAM roles managed by the Data Platform is available here.

For detailed roles please refer to the code.

Using of service account keys within a data pipeline exposes to several security risks deriving from a credentials leak. This blueprint shows how to leverage impersonation to avoid the need of creating keys.

User groups

User groups provide a stable frame of reference that allows decoupling the final set of permissions from the stage where entities and resources are created, and their IAM bindings defined.

We use three groups to control access to resources:

Data Engineers They handle and run the Data Hub, with read access to all resources in order to troubleshoot possible issues with pipelines. This team can also impersonate any service account.
Data Analysts. They perform analysis on datasets, with read access to the Data Warehouse Confidential project, and BigQuery READ/WRITE access to the playground project.
Data Security:. They handle security configurations related to the Data Hub. This team has admin access to the common project to configure Cloud DLP templates or Data Catalog policy tags.

Virtual Private Cloud (VPC) design

As is often the case in real-world configurations, this blueprint accepts as input an existing Shared-VPC via the network_config variable. Make sure that the GKE API (container.googleapis.com) is enabled in the VPC host project. Remember also to configure firewall rules needed for the different products you are going to use: Composer, Dataflow or Dataproc.

If the network_config variable is not provided, one VPC will be created in each project that supports network resources (load, transformation and orchestration).

IP ranges and subnetting

To deploy this blueprint with self-managed VPCs you need the following ranges:

one /24 for the processing project VPC subnet used for Cloud Dataproc workers
one /24 range for the orchestration VPC subnet used for Composer workers
one /22 and one /24 ranges for the secondary ranges associated with the orchestration VPC subnet

If you are using Shared VPC, you need one subnet with one /22 and one /24 secondary range defined for Composer pods and services.

In both VPC scenarios, you also need these ranges for Composer:

one /24 for Cloud SQL
one /28 for the GKE control plane

Resource naming conventions

Resources follow the naming convention described below.

prefix-layer for projects
prefix-layer-product for resources
prefix-layer[2]-gcp-product[2]-counter for services and service accounts

Encryption

We suggest a centralized approach to key management, where Organization Security is the only team that can access encryption material, and keyrings and keys are managed in a project external to the Data Platform.

To configure the use of Cloud KMS on resources, you have to specify the key id on the service_encryption_keys variable. Key locations should match resource locations. Example:

service_encryption_keys = {
    bq       = "KEY_URL"
    composer = "KEY_URL"
    compute  = "KEY_URL"
    storage  = "KEY_URL"
}

This step is optional and depends on customer policies and security best practices.

Data Anonymization

We suggest using Cloud Data Loss Prevention to identify/mask/tokenize your confidential data.

While implementing a Data Loss Prevention strategy is out of scope for this blueprint, we enable the service in two different projects so that Cloud Data Loss Prevention templates can be configured in one of two ways:

during the ingestion phase, from Cloud Dataproc
within the curated layer, in BigQuery or Cloud Dataproc

Cloud Data Loss Prevention resources and templates should be stored in the Common project:

You can find more details and best practices on using DLP to De-identification and re-identification of PII in large-scale datasets in the GCP documentation.

Data Catalog

Data Catalog helps you to document your data entry at scale. Data Catalog relies on tags and tag template to manage metadata for all data entries in a unified and centralized service. To implement column-level security on BigQuery, we suggest to use Tags and Tag templates.

The default configuration will implement 3 tags:

3_Confidential: policy tag for columns that include very sensitive information, such as credit card numbers.
2_Private: policy tag for columns that include sensitive personal identifiable information (PII) information, such as a person's first name.
1_Sensitive: policy tag for columns that include data that cannot be made public, such as the credit limit.

Anything that is not tagged is available to all users who have access to the data warehouse.

For the purpose of the blueprint no groups has access to tagged data. You can configure your tags and roles associated by configuring the data_catalog_tags variable. We suggest using the "Best practices for using policy tags in BigQuery" article as a guide to designing your tags structure and access pattern.

How to run this script

To deploy this blueprint on your GCP organization, you will need

a folder or organization where new projects will be created
a billing account that will be associated with the new projects

The Data Platform is meant to be executed by a Service Account (or a regular user) having this minimal set of permission:

Billing account
- roles/billing.user
Folder level:
- roles/resourcemanager.folderAdmin
- roles/resourcemanager.projectCreator
KMS Keys (If CMEK encryption in use):
- roles/cloudkms.admin or a custom role with cloudkms.cryptoKeys.getIamPolicy, cloudkms.cryptoKeys.list, cloudkms.cryptoKeys.setIamPolicy permissions
Shared VPC host project (if configured):\
- roles/compute.xpnAdmin on the host project folder or org
- roles/resourcemanager.projectIamAdmin on the host project, either with no conditions or with a condition allowing delegated role grants for roles/compute.networkUser, roles/composer.sharedVpcAgent, roles/container.hostServiceAgentUser

Variable configuration

There are three sets of variables you will need to fill in:

project_config = {
    billing_account_id = "123456-123456-123456"
    parent             = "folders/12345678"
}
organization_domain = "domain.com"
prefix              = "myprefix"

For more fine details check variables on variables.tf and update according to the desired configuration.

Remember to create team groups described below.

Once the configuration is complete, run the project factory by running

terraform init
terraform apply

How to use this blueprint from Terraform

While this blueprint can be used as a standalone deployment, it can also be called directly as a Terraform module by providing the variables values as show below:

module "data-platform" {
  source              = "./fabric/blueprints/data-solutions/data-platform-minimal/"
  organization_domain = "example.com"
  project_config = {
    billing_account_id = "123456-123456-123456"
    parent             = "folders/12345678"
  }
  prefix = "myprefix"
}

# tftest modules=23 resources=139

Customizations

Assign roles at BQ Dataset level

To handle multiple groups of data-analysts accessing the same Data Warehouse layer projects but only to the dataset belonging to a specific group, you may want to assign roles at BigQuery dataset level instead of at project-level. To do this, you need to remove IAM binging at project-level for the data-analysts group and give roles at BigQuery dataset level using the iam variable on bigquery-dataset modules.

Project Configuration

The solution can be deployed by creating projects on a given parent (organization or folder) or on existing projects. Configure variable project_config accordingly.

When you deploy the blueprint on existing projects, the blueprint is designed to rely on different projects configuring IAM binding with an additive approach.

Once you have identified the required project granularity for your use case, we suggest adapting the terraform script accordingly and relying on authoritative IAM binding.

Shared VPC

To configure the use of a shared VPC, configure the network_config, example:

network_config = {
  host_project      = "PROJECT_ID"
  network_self_link = "https://www.googleapis.com/compute/v1/projects/PROJECT_ID/global/networks/NAME"
  subnet_self_link = "https://www.googleapis.com/compute/v1/projects/PROJECT_ID/regions/REGION/subnetworks/NAME"
  composer_ip_ranges = {    
    cloudsql   = "192.168.XXX.XXX/24"
    gke_master = "192.168.XXX.XXX/28"
  }
  composer_secondary_ranges = {
    pods     = "pods"
    services = "services"
  }
}

Customer Managed Encryption key

To configure the use of Cloud KMS on resources, configure the service_encryption_keys variable. Key locations should match resource locations. Example:

service_encryption_keys = {
    bq       = "KEY_URL"
    composer = "KEY_URL"
    compute  = "KEY_URL"
    storage  = "KEY_URL"
}

Demo pipeline

The application layer is out of scope of this script. As a demo purpuse only, one Cloud Composer DAGs is provided to document how to deploy a Cloud Dataproc Serverless job on the architecture. You can find examples in the [demo](./demo) folder.

Files

name	description	modules	resources
01-landing.tf	Landing project and resources.	`gcs` · `iam-service-account` · `project`
02-composer.tf	Cloud Composer resources.	`iam-service-account`	`google_composer_environment`
02-dataproc.tf	Cloud Dataproc resources.	`dataproc` · `gcs` · `iam-service-account`
02-processing.tf	Processing project and VPC.	`gcs` · `net-cloudnat` · `net-vpc` · `net-vpc-firewall` · `project`
03-curated.tf	Data curated project and resources.	`bigquery-dataset` · `gcs` · `project`
04-common.tf	Common project and resources.	`data-catalog-policy-tag` · `project`
main.tf	Core locals.		`google_project_iam_member`
outputs.tf	Output variables.
variables.tf	Terraform Variables.

Variables

name	description	type	required	default
organization_domain	Organization domain.	`string`	✓
prefix	Prefix used for resource names.	`string`	✓
project_config	Provide 'billing_account_id' value if project creation is needed, uses existing 'project_ids' if null. Parent is in 'folders/nnn' or 'organizations/nnn' format.	`object({…})`	✓
composer_config	Cloud Composer config.	`object({…})`		`{}`
data_catalog_tags	List of Data Catalog Policy tags to be created with optional IAM binging configuration in {tag => {ROLE => [MEMBERS]}} format.	`map(object({…}))`		`{…}`
deletion_protection	Prevent Terraform from destroying data storage resources (storage buckets, GKE clusters, CloudSQL instances) in this blueprint. When this field is set in Terraform state, a terraform destroy or terraform apply that would delete data storage resources will fail.	`bool`		`false`
enable_services	Flag to enable or disable services in the Data Platform.	`object({…})`		`{}`
groups	User groups.	`map(string)`		`{…}`
location	Location used for multi-regional resources.	`string`		`"eu"`
network_config	Shared VPC network configurations to use. If null networks will be created in projects.	`object({…})`		`{}`
project_suffix	Suffix used only for project ids.	`string`		`null`
region	Region used for regional resources.	`string`		`"europe-west1"`
service_encryption_keys	Cloud KMS to use to encrypt different services. Key location should match service region.	`object({…})`		`{}`

Outputs

name	description	sensitive
bigquery-datasets	BigQuery datasets.
composer	Composer variables.
dataproc-history-server	List of bucket names which have been assigned to the cluster.
gcs_buckets	GCS buckets.
kms_keys	Cloud MKS keys.
network	VPC network.
projects	GCP Projects information.
service_accounts	Service account created.