cloud-foundation-fabric/data-solutions/data-platform-foundations/03-pipeline/gcs_to_bigquery.md

4.0 KiB

Manual pipeline Example: GCS to Bigquery

In this example we will publish person message in the following format:

Lorenzo,Caggioni,1617898199

a Dataflow pipeline will read those messages and import them into a Bigquery table in the DWH project.

[TODO] An autorized view will be created in the datamart project to expose the table. [TODO] Remove hardcoded 'lcaggio' variables and made ENV variable for it. [TODO] Further automation is expected in future.

Create and download keys for Service accounts you created.

Create BQ table

Those steps should be done as Transformation Service Account:

gcloud auth activate-service-account sa-dwh@dwh-lc01.iam.gserviceaccount.com --key-file=sa-dwh.json --project=dwh-lc01

and you can run the command to create a table:

bq mk \
-t \
--description "This is a Test Person table" \
dwh-lc01:bq_raw_dataset.person \
name:STRING,surname:STRING,timestamp:TIMESTAMP

Produce CSV data file, JSON schema file and UDF JS file

Those steps should be done as landing Service Account:

gcloud auth activate-service-account sa-landing@landing-lc01.iam.gserviceaccount.com --key-file=sa-landing.json --project=landing-lc01

Let's now create a series of messages we can use to import:

for i in {0..10} 
do 
  echo "Lorenzo,Caggioni,$(date +%s)" >> person.csv
done

and copy files to the GCS bucket:

gsutil cp person.csv gs://landing-lc01-eu-raw-data

Let's create the data JSON schema:

cat <<'EOF' >> person_schema.json
{
  "BigQuery Schema": [
    {
      "name": "name",
      "type": "STRING"
    },
    {
      "name": "surname",
      "type": "STRING"
    },
    {
      "name": "timestamp",
      "type": "TIMESTAMP"
    }
  ]
}
EOF

and copy files to the GCS bucket:

gsutil cp person_schema.json gs://landing-lc01-eu-data-schema

Let's create the data UDF function to transform message data:

cat <<'EOF' >> person_udf.js
function transform(line) {
  var values = line.split(',');

  var obj = new Object();
  obj.name = values[0];
  obj.surname = values[1];
  obj.timestamp = values[2];
  var jsonString = JSON.stringify(obj);

  return jsonString;
}
EOF

and copy files to the GCS bucket:

gsutil cp person_udf.js gs://landing-lc01-eu-data-schema

if you want to check files copied to GCS, you can use the Transformation service account:

gcloud auth activate-service-account sa-transformation@transformation-lc01.iam.gserviceaccount.com --key-file=sa-transformation.json --project=transformation-lc01

and read a message (message won't be acked and will stay in the subscription):

gsutil ls gs://landing-lc01-eu-raw-data
gsutil ls gs://landing-lc01-eu-data-schema

Dataflow

Those steps should be done as transformation Service Account:

gcloud auth activate-service-account sa-transformation@transformation-lc01.iam.gserviceaccount.com --key-file=sa-transformation.json --project=transformation-lc01

Let's than start a Dataflwo batch pipeline using a Google provided template using internal only IPs, the created network and subnetwork, the appropriate service account and requested parameters:

gcloud dataflow jobs run test_batch_lcaggio01 \
    --gcs-location gs://dataflow-templates/latest/GCS_Text_to_BigQuery \
    --project transformation-lc01 \
    --region europe-west3 \
    --disable-public-ips \
    --network transformation-vpc \
    --subnetwork regions/europe-west3/subnetworks/transformation-subnet \
    --staging-location gs://transformation-lc01-eu-temp \
    --service-account-email sa-transformation@transformation-lc01.iam.gserviceaccount.com \
    --parameters \
javascriptTextTransformFunctionName=transform,\
JSONPath=gs://landing-lc01-eu-data-schema/person_schema.json,\
javascriptTextTransformGcsPath=gs://landing-lc01-eu-data-schema/person_udf.js,\
inputFilePattern=gs://landing-lc01-eu-raw-data/person.csv,\
outputTable=dwh-lc01:bq_raw_dataset.person,\
bigQueryLoadingTemporaryDirectory=gs://transformation-lc01-eu-temp