Fixes

2020-06-30 10:56:27 +02:00 · 2020-06-30 10:56:27 +02:00 · 58e5dfa620
parent 9a4ec24093
commit 58e5dfa620
5 changed files with 56 additions and 14 deletions
--- a/data-solutions/gcs-to-bq/README.md
+++ b/data-solutions/gcs-to-bq/README.md
@ -69,7 +69,7 @@ This sample creates several distinct groups of resources:
 | vm | GCE VMs. |  |
 <!-- END TFDOC -->

-## Test your environment
+## Test your environment with Cloud Dataflow
 You can now connect to the GCE instance with the following command:

 ```hcl
@ -106,14 +106,33 @@ python data_ingestion.py \
 --region=europe-west1 \
 --staging_location=gs://lc-001-eu-df-tmplocation/ \
 --temp_location=gs://lc-001-eu-df-tmplocation/ \
--project=lcaggio-demo-001 \
--input=gs://lc-001-eu-data/person.csv \
+--project=lcaggio-demo \
+--input=gs://lc-eu-data/person.csv \
 --output=bq_dataset.df_import \
--service_account_email=df-test@lcaggio-aa-demo-001.iam.gserviceaccount.com \
+--service_account_email=df-test@lcaggio-demo.iam.gserviceaccount.com \
 --network=local \
 --subnetwork=regions/europe-west1/subnetworks/subnet \
 --dataflow_kms_key=projects/lcaggio-demo-kms/locations/europe-west1/keyRings/my-keyring-regional/cryptoKeys/key-df \
 --no_use_public_ips
 ```

+You can check data imported into Google BigQuery from the Google Cloud Console UI.
+
+## Test your environment with 'bq' CLI
+You can now connect to the GCE instance with the following command:
+
+```hcl
+ gcloud compute ssh vm-example-1
+```
+
+You can run now a simple 'bq load' command to import data into Bigquery. Below an example command:
+
+```hcl
+bq load \
+--source_format=CSV \
+bq_dataset.bq_import \
+gs://my-bucket/person.csv \
+schema_bq_import.json
+```
+
 You can check data imported into Google BigQuery from the Google Cloud Console UI.
--- a/data-solutions/gcs-to-bq/main.tf
+++ b/data-solutions/gcs-to-bq/main.tf
@ -15,7 +15,8 @@
 locals {
  vm-startup-script = join("\n", [
    "#! /bin/bash",
-    "apt-get update && apt-get install -y bash-completion git python3-venv gcc build-essential python-dev"
+    "apt-get update && apt-get install -y bash-completion git python3-venv gcc build-essential python-dev python3-dev",
+    "pip3 install --upgrade setuptools pip"
  ])
 }

@ -230,7 +231,7 @@ module "vm_example" {
      }
    }
  ]
-  instance_count = 1
+  instance_count = 2
  boot_disk = {
    image        = "projects/debian-cloud/global/images/family/debian-10"
    type         = "pd-ssd"
--- a/data-solutions/gcs-to-bq/scripts/README.md
+++ b/data-solutions/gcs-to-bq/scripts/README.md
@ -0,0 +1,4 @@
+# Sripts
+In this section you can find two simple scripts to test your environment:
+ - [Data ingestion](./data_ingestion/): a simple Apache Beam Python pipeline to import data from Google Cloud Storage into Bigquery.
+ - [Person details generator](./person_details_generator/): a simple script to generate some random data to test your environment.
--- a/data-solutions/gcs-to-bq/scripts/data_ingestion/README.md
+++ b/data-solutions/gcs-to-bq/scripts/data_ingestion/README.md
@ -28,7 +28,7 @@ Create a new virtual environment (recommended) and install requirements:
 ```
 virtualenv env
 source ./env/bin/activate
-pip install -r requirements.txt
+pip3 install -r requirements.txt
 ```

 ## 4. Upload files into Google Cloud Storage
@ -63,7 +63,7 @@ python data_ingestion.py \
 or you can run the pipeline on Google Dataflow using the following command:

 ```
-python pipelines/data_ingestion_configurable.py \
+python data_ingestion.py \
 --runner=DataflowRunner \
 --max_num_workers=100 \
 --autoscaling_algorithm=THROUGHPUT_BASED \
@ -71,10 +71,27 @@ python pipelines/data_ingestion_configurable.py \
 --staging_location=###PUT HERE GCS STAGING LOCATION### \
 --temp_location=###PUT HERE GCS TMP LOCATION###\
 --project=###PUT HERE PROJECT ID### \
--input-bucket=###PUT HERE GCS BUCKET NAME### \
--input-path=###PUT HERE INPUT FOLDER### \
--input-files=###PUT HERE FILE NAMES### \
--bq-dataset=###PUT HERE BQ DATASET NAME###
+--input=###PUT HERE GCS BUCKET NAME. EXAMPLE: gs://bucket_name/person.csv### \
+--output=###PUT HERE BQ DATASET NAME. EXAMPLE: bq_dataset.df_import### \
+```
+
+Below an example to run the pipeline specifying Network and Subnetwork, using private IPs and using a KMS key to encrypt data at rest:
+
+```
+python data_ingestion.py \
+--runner=DataflowRunner \
+--max_num_workers=100 \
+--autoscaling_algorithm=THROUGHPUT_BASED \
+--region=###PUT HERE REGION### \
+--staging_location=###PUT HERE GCS STAGING LOCATION### \
+--temp_location=###PUT HERE GCS TMP LOCATION###\
+--project=###PUT HERE PROJECT ID### \
+--network=###PUT HERE YOUR NETWORK### \
+--subnetwork=###PUT HERE YOUR SUBNETWORK. EXAMPLE: regions/europe-west1/subnetworks/subnet### \
+--dataflowKmsKey=###PUT HERE KMES KEY. Example: projects/lcaggio-d-4-kms/locations/europe-west1/keyRings/my-keyring-regional/cryptoKeys/key-df### \
+--input=###PUT HERE GCS BUCKET NAME. EXAMPLE: gs://bucket_name/person.csv### \
+--output=###PUT HERE BQ DATASET NAME. EXAMPLE: bq_dataset.df_import### \
+--no_use_public_ips
 ```

 ## 6. Check results
--- a/data-solutions/gcs-to-bq/scripts/data_ingestion/REQUIREMENTS.txt
+++ b/data-solutions/gcs-to-bq/scripts/data_ingestion/REQUIREMENTS.txt
@ -1,2 +1,3 @@
-wheel
-apache-beam
+apache-beam[gcp]
+setuptools
+wheel