History

Julio Castillo 1d13e3e624 Add more validations to linter - Ensure all variables and outputs are sorted - Ensure all variables and outputs have a description - Add data-solutions/data-platform-foundations to linter Fix all modules to follow these new conventions.		2021-10-08 18:26:04 +02:00
..
01-environment	Add more validations to linter	2021-10-08 18:26:04 +02:00
02-resources	Add more validations to linter	2021-10-08 18:26:04 +02:00
03-pipeline	Bugfixing Data Foundations (#310 )	2021-09-28 17:13:18 +02:00
README.md	add vpc-sc support	2021-07-08 16:51:57 +02:00

README.md

Data Foundation Platform

The goal of this example is to Build a robust and flexible Data Foundation on GCP, providing opinionated defaults while still allowing customers to quickly and reliably build and scale out additional data pipelines.

The example is composed of three separate provisioning workflows, which are deisgned to be plugged together and create end to end Data Foundations, that support multiple data pipelines on top.

Environment Setup (once per environment)
- projects
- VPC configuration
- Composer environment and identity
- shared buckets and datasets
Data Source Setup (once per data source)
- landing and archive bucket
- internal and external identities
- domain specific datasets
Pipeline Setup (once per pipeline)
- pipeline-specific tables and views
- pipeline code
- Composer DAG

The resulting GCP architecture is outlined in this diagram

A demo pipeline is also part of this example: it can be built and run on top of the foundational infrastructure to quickly verify or test the setup.

Prerequisites

In order to bring up this example, you will need

a folder or organization where new projects will be created
a billing account that will be associated to new projects
an identity (user or service account) with owner permissions on the folder or org, and billing user permissions on the billing account

Bringing up the platform

The end-to-end example is composed of 2 foundational, and 1 optional steps:

The environment setup is designed to manage a single environment. Various strategies like workspaces, branching, or even separate clones can be used to support multiple environments.

TODO

Description	Priority (1:High - 5:Low )	Status
DLP best practices in the pipeline	2	Not Started
Add Composer with a static DAG running the example	3	Not Started
Integrate CI/CD composer data processing workflow framework	3	Not Started
Schema changes, how to handle	4	Not Started
Data lineage	4	Not Started
Data quality checks	4	Not Started
Shared-VPC	5	Not Started
Logging & monitoring	TBD	Not Started
Orcestration for ingestion pipeline (just in the readme)	TBD	Not Started