cloud-foundation-fabric/data-solutions/data-platform-foundations/README.md

62 lines
2.9 KiB
Markdown
Raw Normal View History

# Data Foundation Platform
2021-05-18 09:30:21 -07:00
The goal of this example is to Build a robust and flexible Data Foundation on GCP, providing opinionated defaults while still allowing customers to quickly and reliably build and scale out additional data pipelines.
2021-05-18 09:30:21 -07:00
The example is composed of three separate provisioning workflows, which are deisgned to be plugged together and create end to end Data Foundations, that support multiple data pipelines on top.
2021-05-18 09:30:21 -07:00
2021-06-15 06:12:14 -07:00
1. **[Environment Setup](./01-environment/)**
*(once per environment)*
2021-06-15 06:12:14 -07:00
* projects
* VPC configuration
* Composer environment and identity
* shared buckets and datasets
1. **[Data Source Setup](./02-resources)**
*(once per data source)*
2021-06-15 06:12:14 -07:00
* landing and archive bucket
* internal and external identities
* domain specific datasets
1. **[Pipeline Setup](./03-pipeline)**
*(once per pipeline)*
2021-06-15 06:12:14 -07:00
* pipeline-specific tables and views
* pipeline code
* Composer DAG
2021-05-18 09:30:21 -07:00
The resulting GCP architecture is outlined in this diagram
2021-06-15 06:12:14 -07:00
![Target architecture](./02-resources/diagram.png)
2021-05-18 09:30:21 -07:00
A demo pipeline is also part of this example: it can be built and run on top of the foundational infrastructure to quickly verify or test the setup.
2021-05-18 09:30:21 -07:00
## Prerequisites
2021-05-18 09:30:21 -07:00
In order to bring up this example, you will need
2021-05-18 09:30:21 -07:00
- a folder or organization where new projects will be created
- a billing account that will be associated to new projects
- an identity (user or service account) with owner permissions on the folder or org, and billing user permissions on the billing account
2021-05-18 09:30:21 -07:00
## Bringing up the platform
2021-05-18 09:30:21 -07:00
2021-06-15 11:55:25 -07:00
[![Open in Cloud Shell](https://gstatic.com/cloudssh/images/open-btn.svg)](https://ssh.cloud.google.com/cloudshell/editor?cloudshell_git_repo=https%3A%2F%2Fgithub.com%2Fterraform-google-modules%2Fcloud-foundation-fabric.git&cloudshell_open_in_editor=README.md&cloudshell_workspace=data-solutions%2Fdata-platform-foundations)
2021-06-15 06:12:14 -07:00
The end-to-end example is composed of 2 foundational, and 1 optional steps:
2021-05-18 09:30:21 -07:00
2021-06-15 06:12:14 -07:00
1. [Environment setup](./01-environment/)
1. [Data source setup](./02-resources/)
1. (Optional) [Pipeline setup](./03-pipeline/)
2021-05-18 09:30:21 -07:00
The environment setup is designed to manage a single environment. Various strategies like workspaces, branching, or even separate clones can be used to support multiple environments.
2021-05-18 09:30:21 -07:00
## TODO
2021-05-18 09:30:21 -07:00
| Description | Priority (1:High - 5:Low ) | Status | Remarks |
|-------------|----------|:------:|---------|
| DLP best practices in the pipeline | 2 | Not Started | |
| Add Composer with a static DAG running the example | 3 | Not Started | |
| Integrate [CI/CD composer data processing workflow framework](https://github.com/jaketf/ci-cd-for-data-processing-workflow) | 3 | Not Started | |
| Schema changes, how to handle | 4 | Not Started | |
| Data lineage | 4 | Not Started | |
| Data quality checks | 4 | Not Started | |
| Shared-VPC | 5 | Not Started | |
| Logging & monitoring | TBD | Not Started | |
| Orcestration for ingestion pipeline (just in the readme) | TBD | Not Started | |