This commit is contained in:
Lorenzo Caggioni 2020-06-30 10:56:27 +02:00
parent 9a4ec24093
commit 58e5dfa620
5 changed files with 56 additions and 14 deletions

View File

@ -69,7 +69,7 @@ This sample creates several distinct groups of resources:
| vm | GCE VMs. | |
<!-- END TFDOC -->
## Test your environment
## Test your environment with Cloud Dataflow
You can now connect to the GCE instance with the following command:
```hcl
@ -106,14 +106,33 @@ python data_ingestion.py \
--region=europe-west1 \
--staging_location=gs://lc-001-eu-df-tmplocation/ \
--temp_location=gs://lc-001-eu-df-tmplocation/ \
--project=lcaggio-demo-001 \
--input=gs://lc-001-eu-data/person.csv \
--project=lcaggio-demo \
--input=gs://lc-eu-data/person.csv \
--output=bq_dataset.df_import \
--service_account_email=df-test@lcaggio-aa-demo-001.iam.gserviceaccount.com \
--service_account_email=df-test@lcaggio-demo.iam.gserviceaccount.com \
--network=local \
--subnetwork=regions/europe-west1/subnetworks/subnet \
--dataflow_kms_key=projects/lcaggio-demo-kms/locations/europe-west1/keyRings/my-keyring-regional/cryptoKeys/key-df \
--no_use_public_ips
```
You can check data imported into Google BigQuery from the Google Cloud Console UI.
## Test your environment with 'bq' CLI
You can now connect to the GCE instance with the following command:
```hcl
gcloud compute ssh vm-example-1
```
You can run now a simple 'bq load' command to import data into Bigquery. Below an example command:
```hcl
bq load \
--source_format=CSV \
bq_dataset.bq_import \
gs://my-bucket/person.csv \
schema_bq_import.json
```
You can check data imported into Google BigQuery from the Google Cloud Console UI.

View File

@ -15,7 +15,8 @@
locals {
vm-startup-script = join("\n", [
"#! /bin/bash",
"apt-get update && apt-get install -y bash-completion git python3-venv gcc build-essential python-dev"
"apt-get update && apt-get install -y bash-completion git python3-venv gcc build-essential python-dev python3-dev",
"pip3 install --upgrade setuptools pip"
])
}
@ -230,7 +231,7 @@ module "vm_example" {
}
}
]
instance_count = 1
instance_count = 2
boot_disk = {
image = "projects/debian-cloud/global/images/family/debian-10"
type = "pd-ssd"

View File

@ -0,0 +1,4 @@
# Sripts
In this section you can find two simple scripts to test your environment:
- [Data ingestion](./data_ingestion/): a simple Apache Beam Python pipeline to import data from Google Cloud Storage into Bigquery.
- [Person details generator](./person_details_generator/): a simple script to generate some random data to test your environment.

View File

@ -28,7 +28,7 @@ Create a new virtual environment (recommended) and install requirements:
```
virtualenv env
source ./env/bin/activate
pip install -r requirements.txt
pip3 install -r requirements.txt
```
## 4. Upload files into Google Cloud Storage
@ -63,7 +63,7 @@ python data_ingestion.py \
or you can run the pipeline on Google Dataflow using the following command:
```
python pipelines/data_ingestion_configurable.py \
python data_ingestion.py \
--runner=DataflowRunner \
--max_num_workers=100 \
--autoscaling_algorithm=THROUGHPUT_BASED \
@ -71,10 +71,27 @@ python pipelines/data_ingestion_configurable.py \
--staging_location=###PUT HERE GCS STAGING LOCATION### \
--temp_location=###PUT HERE GCS TMP LOCATION###\
--project=###PUT HERE PROJECT ID### \
--input-bucket=###PUT HERE GCS BUCKET NAME### \
--input-path=###PUT HERE INPUT FOLDER### \
--input-files=###PUT HERE FILE NAMES### \
--bq-dataset=###PUT HERE BQ DATASET NAME###
--input=###PUT HERE GCS BUCKET NAME. EXAMPLE: gs://bucket_name/person.csv### \
--output=###PUT HERE BQ DATASET NAME. EXAMPLE: bq_dataset.df_import### \
```
Below an example to run the pipeline specifying Network and Subnetwork, using private IPs and using a KMS key to encrypt data at rest:
```
python data_ingestion.py \
--runner=DataflowRunner \
--max_num_workers=100 \
--autoscaling_algorithm=THROUGHPUT_BASED \
--region=###PUT HERE REGION### \
--staging_location=###PUT HERE GCS STAGING LOCATION### \
--temp_location=###PUT HERE GCS TMP LOCATION###\
--project=###PUT HERE PROJECT ID### \
--network=###PUT HERE YOUR NETWORK### \
--subnetwork=###PUT HERE YOUR SUBNETWORK. EXAMPLE: regions/europe-west1/subnetworks/subnet### \
--dataflowKmsKey=###PUT HERE KMES KEY. Example: projects/lcaggio-d-4-kms/locations/europe-west1/keyRings/my-keyring-regional/cryptoKeys/key-df### \
--input=###PUT HERE GCS BUCKET NAME. EXAMPLE: gs://bucket_name/person.csv### \
--output=###PUT HERE BQ DATASET NAME. EXAMPLE: bq_dataset.df_import### \
--no_use_public_ips
```
## 6. Check results

View File

@ -1,2 +1,3 @@
wheel
apache-beam
apache-beam[gcp]
setuptools
wheel