cloud-foundation-fabric/data-solutions/data-platform-foundations/03-pipeline/pubsub_to_bigquery.md

2.7 KiB

Manual pipeline Example: PubSub to Bigquery

In this example we will publish person message in the following format:

name: Name
surname: Surname
timestamp: 1617898199

a Dataflow pipeline will read those messages and import them into a Bigquery table in the DWH project.

An autorized view will be created in the datamart project to expose the table.

[TODO] Further automation is expected in future.

Set up the env vars

export DWH_PROJECT_ID=**dwh_project_id** 
export LANDING_PROJECT_ID=**landing_project_id** 
export TRANSFORMATION_PROJECT_ID=*transformation_project_id*

Create BQ table

Those steps should be done as DWH Service Account.

You can run the command to create a table:

gcloud --impersonate-service-account=sa-datawh@$DWH_PROJECT_ID.iam.gserviceaccount.com \
alpha bq tables create person \
--project=$DWH_PROJECT_ID --dataset=bq_raw_dataset \
--description "This is a Test Person table" \
--schema name=STRING,surname=STRING,timestamp=TIMESTAMP

Produce PubSub messages

Those steps should be done as landing Service Account:

Let's now create a series of messages we can use to import:

for i in {0..10} 
do 
  gcloud --impersonate-service-account=sa-landing@$LANDING_PROJECT_ID.iam.gserviceaccount.com pubsub topics publish projects/$LANDING_PROJECT_ID/topics/landing-1 --message="{\"name\": \"Lorenzo\", \"surname\": \"Caggioni\", \"timestamp\": \"$(date +%s)\"}"
done

if you want to check messages published, you can use the Transformation service account and read a message (message won't be acked and will stay in the subscription):

gcloud --impersonate-service-account=sa-transformation@$TRANSFORMATION_PROJECT_ID.iam.gserviceaccount.com pubsub subscriptions pull projects/$LANDING_PROJECT_ID/subscriptions/sub1

Dataflow

Those steps should be done as transformation Service Account:

Let's than start a Dataflow streaming pipeline using a Google provided template using internal only IPs, the created network and subnetwork, the appropriate service account and requested parameters:

gcloud dataflow jobs run test_streaming01 \
    --gcs-location gs://dataflow-templates/latest/PubSub_Subscription_to_BigQuery \
    --project $TRANSFORMATION_PROJECT_ID \
    --region europe-west3 \
    --disable-public-ips \
    --network transformation-vpc \
    --subnetwork regions/europe-west3/subnetworks/transformation-subnet \
    --staging-location gs://$TRANSFORMATION_PROJECT_ID-eu-temp \
    --service-account-email sa-transformation@$TRANSFORMATION_PROJECT_ID.iam.gserviceaccount.com \
    --parameters \
inputSubscription=projects/$LANDING_PROJECT_ID/subscriptions/sub1,\
outputTableSpec=$DWH_PROJECT_ID:bq_raw_dataset.person