History

simonebruzzechesse d11c380aec Format python files in blueprints (#2079 ) * format python files in blueprints * update check on blueprints python code * update python linter in CI workflow		2024-02-15 09:37:49 +01:00
..
data	Rename examples folder to blueprints	2022-09-09 16:38:43 +02:00
dataflow-csv2bq	Format python files in blueprints (#2079 )	2024-02-15 09:37:49 +01:00
README.md	[Feature] Update data platform blue print with Dataflow Flex template (#1105 )	2023-02-06 07:35:40 +01:00
datapipeline.py	Format python files in blueprints (#2079 )	2024-02-15 09:37:49 +01:00
datapipeline_dc_tags.py	Format python files in blueprints (#2079 )	2024-02-15 09:37:49 +01:00
datapipeline_dc_tags_flex.py	Format python files in blueprints (#2079 )	2024-02-15 09:37:49 +01:00
datapipeline_flex.py	Format python files in blueprints (#2079 )	2024-02-15 09:37:49 +01:00
delete_table.py	Format python files in blueprints (#2079 )	2024-02-15 09:37:49 +01:00

README.md

Data ingestion Demo

In this folder, you can find an example to ingest data on the data platform instantiated here.

The example is not intended to be a production-ready code.

Demo use case

The demo imports purchase data generated by a store.

Input files

Data are uploaded to the drop off GCS bucket. File structure:

customers.csv: Comma separate value with customer information in the following format: Customer ID, Name, Surname, Registration Timestamp
purchases.csv: Comma separate value with customer information in the following format: Item ID, Customer ID, Item, Item price, Purchase Timestamp

Data processing pipelines

Different data pipelines are provided to highlight different features and patterns. For the purpose of the example, a single pipeline handle all data lifecycles. When adapting them to your real use case, you may want to evaluate the option to handle each functional step on a separate pipeline or a dedicated tool. For example, you may want to use Dataform to handle data schemas lifecycle.

Below you can find a description of each example:

Simple import data: datapipeline.py is a simple pipeline to import provided data from the drop off Google Cloud Storage bucket to the Data Hub Confidential layer joining customers and purchases tables into customerpurchase table.
Import data with Policy Tags: datapipeline_dc_tags.py imports provided data from drop off bucket to the Data Hub Confidential layer protecting sensitive data using Data Catalog policy Tags.
Delete tables: delete_table.py deletes BigQuery tables created by import pipelines.

Running the demo

To run demo examples, please follow the following steps:

01: Copy sample data to the drop off Cloud Storage bucket impersonating the load service account.
02: Copy sample data structure definition in the orchestration Cloud Storage bucket impersonating the orchestration service account.
03: Copy the Cloud Composer DAG to the Cloud Composer Storage bucket impersonating the orchestration service account.
04: Build the Dataflow Flex template and image via a Cloud Build pipeline
05: Open the Cloud Composer Airflow UI and run the imported DAG.
06: Run the BigQuery query to see results.

You can find pre-computed commands in the demo_commands output variable of the deployed terraform data pipeline.