33 lines
2.3 KiB
Markdown
33 lines
2.3 KiB
Markdown
# Data ingestion Demo
|
|
|
|
In this folder, you can find an example to ingest data on the `data platform` instantiated [here](../).
|
|
|
|
The example is not intended to be a production-ready code.
|
|
|
|
## Demo use case
|
|
The demo imports purchase data generated by a store.
|
|
|
|
## Input files
|
|
Data are uploaded to the `drop off` GCS bucket. File structure:
|
|
- `customers.csv`: Comma separate value with customer information in the following format: Customer ID, Name, Surname, Registration Timestamp
|
|
- `purchases.csv`: Comma separate value with customer information in the following format: Item ID, Customer ID, Item, Item price, Purchase Timestamp
|
|
|
|
## Data processing pipelines
|
|
Different data pipelines are provided to highlight different features and patterns. For the purpose of the example, a single pipeline handle all data lifecycles. When adapting them to your real use case, you may want to evaluate the option to handle each functional step on a separate pipeline or a dedicated tool. For example, you may want to use `Dataform` to handle data schemas lifecycle.
|
|
|
|
Below you can find a description of each example:
|
|
- Simple import data: [`datapipeline.py`](./datapipeline.py) is a simple pipeline to import provided data from the `drop off` Google Cloud Storage bucket to the Data Hub Confidential layer joining `customers` and `purchases` tables into `customerpurchase` table.
|
|
- Import data with Policy Tags: [`datapipeline_dc_tags.py`](./datapipeline.py) imports provided data from `drop off` bucket to the Data Hub Confidential layer protecting sensitive data using Data Catalog policy Tags.
|
|
- Delete tables: [`delete_table.py`](./delete_table.py) deletes BigQuery tables created by import pipelines.
|
|
|
|
## Runnin the demo
|
|
To run demo examples, please follow the following steps:
|
|
|
|
- 01: copy sample data to the `drop off` Cloud Storage bucket impersonating the `load` service account.
|
|
- 02: copy sample data structure definition in the `orchestration` Cloud Storage bucket impersonating the `orchestration` service account.
|
|
- 03: copy the Cloud Composer DAG to the Cloud Composer Storage bucket impersonating the `orchestration` service account.
|
|
- 04: Open the Cloud Composer Airflow UI and run the imported DAG.
|
|
- 05: Run the BigQuery query to see results.
|
|
|
|
You can find pre-computed commands in the `demo_commands` output variable of the deployed terraform [data pipeline](../).
|