cloud-foundation-fabric/blueprints/data-solutions/data-platform-foundations/demo
simonebruzzechesse d11c380aec
Format python files in blueprints (#2079)
* format python files in blueprints
* update check on blueprints python code
* update python linter in CI workflow
2024-02-15 09:37:49 +01:00
..
data Rename examples folder to blueprints 2022-09-09 16:38:43 +02:00
dataflow-csv2bq Format python files in blueprints (#2079) 2024-02-15 09:37:49 +01:00
README.md [Feature] Update data platform blue print with Dataflow Flex template (#1105) 2023-02-06 07:35:40 +01:00
datapipeline.py Format python files in blueprints (#2079) 2024-02-15 09:37:49 +01:00
datapipeline_dc_tags.py Format python files in blueprints (#2079) 2024-02-15 09:37:49 +01:00
datapipeline_dc_tags_flex.py Format python files in blueprints (#2079) 2024-02-15 09:37:49 +01:00
datapipeline_flex.py Format python files in blueprints (#2079) 2024-02-15 09:37:49 +01:00
delete_table.py Format python files in blueprints (#2079) 2024-02-15 09:37:49 +01:00

README.md

Data ingestion Demo

In this folder, you can find an example to ingest data on the data platform instantiated here.

The example is not intended to be a production-ready code.

Demo use case

The demo imports purchase data generated by a store.

Input files

Data are uploaded to the drop off GCS bucket. File structure:

  • customers.csv: Comma separate value with customer information in the following format: Customer ID, Name, Surname, Registration Timestamp
  • purchases.csv: Comma separate value with customer information in the following format: Item ID, Customer ID, Item, Item price, Purchase Timestamp

Data processing pipelines

Different data pipelines are provided to highlight different features and patterns. For the purpose of the example, a single pipeline handle all data lifecycles. When adapting them to your real use case, you may want to evaluate the option to handle each functional step on a separate pipeline or a dedicated tool. For example, you may want to use Dataform to handle data schemas lifecycle.

Below you can find a description of each example:

  • Simple import data: datapipeline.py is a simple pipeline to import provided data from the drop off Google Cloud Storage bucket to the Data Hub Confidential layer joining customers and purchases tables into customerpurchase table.
  • Import data with Policy Tags: datapipeline_dc_tags.py imports provided data from drop off bucket to the Data Hub Confidential layer protecting sensitive data using Data Catalog policy Tags.
  • Delete tables: delete_table.py deletes BigQuery tables created by import pipelines.

Running the demo

To run demo examples, please follow the following steps:

  • 01: Copy sample data to the drop off Cloud Storage bucket impersonating the load service account.
  • 02: Copy sample data structure definition in the orchestration Cloud Storage bucket impersonating the orchestration service account.
  • 03: Copy the Cloud Composer DAG to the Cloud Composer Storage bucket impersonating the orchestration service account.
  • 04: Build the Dataflow Flex template and image via a Cloud Build pipeline
  • 05: Open the Cloud Composer Airflow UI and run the imported DAG.
  • 06: Run the BigQuery query to see results.

You can find pre-computed commands in the demo_commands output variable of the deployed terraform data pipeline.