update Data Platform blueprint README with more example Dataflow commands

This commit is contained in:
Ayman Farhat 2023-02-16 15:45:20 +01:00
parent e64e8db20d
commit a853dc4fe2
1 changed files with 62 additions and 3 deletions

View File

@ -3,7 +3,66 @@ This demo serves as a simple example of building and launching a Flex Template D
![Dataflow pipeline overview](../../images/df_demo_pipeline.png "Dataflow pipeline overview")
## Example build run
## Local development run
For local development, the pipeline can be launched from the local machine for testing purposes using different runners depending on the scope of the test.
### Using the Beam DirectRunner
The below example uses the Beam DirectRunner. The use case for this runner is mainly for quick local run tests on low volume of data.
```
CSV_FILE=gs://[TEST-BUCKET]/customers.csv
JSON_SCHEMA=gs://[TEST-BUCKET]/customers_schema.json
OUTPUT_TABLE=[TEST-PROJ].[TEST-DATASET].customers
PIPELINE_STAGIN_PATH="gs://[TEST-STAGING-BUCKET]"
python src/csv2bq.py \
--runner="DirectRunner" \
--csv_file=$CSV_FILE \
--json_schema=$JSON_SCHEMA \
--output_table=$OUTPUT_TABLE \
--temp_location=$PIPELINE_STAGIN_PATH/tmp
```
*Note:* All paths mentioned can be local paths or on GCS. For cloud resources referenced (GCS and BigQuery), make sure that the user launching the command is authenticated to GCP via `gcloud auth application-default login` and has the required access privileges to those resources.
### Using the DataflowRunner with a local CLI launch
The below example uses the DataflowRunner locally. The use case for this is for running local tests on larger volumes of test data and verifying that the pipeline runs well on Dataflow, before compiling it into a template.
```
PROJECT_ID=[TEST-PROJECT]
REGION=[REGION]
SUBNET=[SUBNET-NAME]
DEV_SERVICE_ACCOUNT=[DEV-SA]
PIPELINE_STAGIN_PATH="gs://[TEST-STAGING-BUCKET]"
CSV_FILE=gs://[TEST-BUCKET]/customers.csv
JSON_SCHEMA=gs://[TEST-BUCKET]/customers_schema.json
OUTPUT_TABLE=[TEST-PROJ].[TEST-DATASET].customers
python src/csv2bq.py \
--runner="Dataflow" \
--project=$PROJECT_ID \
--region=$REGION \
--csv_file=$CSV_FILE \
--json_schema=$JSON_SCHEMA \
--output_table=$OUTPUT_TABLE \
--temp_location=$PIPELINE_STAGIN_PATH/tmp
--staging_location=$PIPELINE_STAGIN_PATH/stage \
--subnetwork="regions/$REGION/subnetworks/$SUBNET" \
--impersonate_service_account=$DEV_SERVICE_ACCOUNT \
--no_use_public_ips
```
In terms of resource access priveleges, you can choose to impersonate another service account, which could be defined for development resource access. The authenticated user launching this pipeline will need to have the role `roles/iam.serviceAccountTokenCreator`. If you choose to launch the pipeline without service account impersonation, it will use the default compute service account assigned of the target project.
## Dataflow Flex Template run
For production, and as outline in the Data Platform demo, we build and launch the pipeline as a Flex Template, making it available for other cloud services(such as Apache Airflow) and users to trigger launch instances of it on demand.
### Build launch
Below is an example for triggering the Dataflow flex template build pipeline defined in `cloudbuild.yaml`. The Terraform output provides an example as well filled with the parameters values based on the generated resources in the data platform.
@ -28,9 +87,9 @@ gcloud builds submit \
**Note:** For the scope of the demo, the launch of this build is manual, but in production, this build would be launched via a configured cloud build trigger when new changes are merged into the code branch of the Dataflow template.
## Example Dataflow pipeline launch in bash (from flex template)
### Dataflow Flex Template run
Below is an example of launching a dataflow pipeline manually, based on the built template. When launched manually, the Dataflow pipeline would be launched via the orchestration service account, which is what the Airflow DAG is also using in the scope of this demo.
After the build step succeeds. You can launch dataflow pipeline from CLI (outline in this example) or the API via Airflow's operator. For the use case of the data platform, the Dataflow pipeline would be launched via the orchestration service account, which is what the Airflow DAG is also using in the scope of this demo.
**Note:** In the data platform demo, the launch of this Dataflow pipeline is handled by the airflow operator (DataflowStartFlexTemplateOperator).