This repository contains the necessary Terraform modules to securely deploy a basic ETL pipeline that will dump data from a Google Cloud Storage (GCS) bucket to tables in BigQuery.
* Transformation: cleaning the data, putting it into a common format, calculating other fields, taking out duplicates or erroneous records so it can be stored into a target.
* Loading: inserting the formatted data into the target database, data store, data warehouse or data lake.
You can learn more about cloud-based ETL [here](https://cloud.google.com/learn/what-is-etl).
## Use cases
Whether you’re transferring from another Cloud Service Provider or you’re taking your first steps into the cloud with Google Cloud, building a data pipeline sets a good foundation to begin deriving insights for your business.
* __Anomaly Detection__: building data pipelines to identify cyber security threats or fraudulent transactions using machine learning (ML) models.
* __Interactive Data Analysis__: carry out interactive data analysis with BigQuery BI Engine that enables you to analyze large and complex datasets interactively with sub-second query response time and high concurrency.
* __Predictive Forecasting__: building solid pipelines to capture real-time data for ML modeling and using it as a forecasting engine for situations ranging from weather predictions to market forecasting.
* __Create Machine Learning models__: using BigQueryML you can create and execute machine learning models in BigQuery using standard SQL queries. Create a variety of models pre-built into BigQuery that you train with your data.
The main components that we would be setting up are (to learn more about these products, click on the hyperlinks):
* [Cloud Storage (GCS) bucket](https://cloud.google.com/storage/): data lake solution to store extracted raw data that must undergo some kind of transformation.
* [Cloud Dataflow pipeline](https://cloud.google.com/dataflow): to build fully managed batch and streaming pipelines to transform data stored in GCS buckets ready for processing in the Data Warehouse using Apache Beam.
* [BigQuery datasets and tables](https://cloud.google.com/bigquery): to store the transformed data in and query it using SQL, use it to make reports or begin training [machine learning](https://cloud.google.com/bigquery-ml/docs/introduction) models without having to take your data out.
* [Service accounts](https://cloud.google.com/iam/docs/service-accounts) (__created with least privilege on each resource__): one for uploading data into the GCS bucket, one for Orchestration, one for Dataflow instances and one for the BigQuery tables. You can also configure users or groups of users to assign them a viewer role on the created resources and the ability to impersonate service accounts to test the Dataflow pipelines before automating them with a tool like [Cloud Composer](https://cloud.google.com/composer).
For a full list of the resources that will be created, please refer to the [github repository](https://github.com/GoogleCloudPlatform/cloud-foundation-fabric/tree/master/blueprints/data-solutions/gcs-to-bq-with-least-privileges) for this project. If you're migrating from another Cloud Provider, refer to [this](https://cloud.google.com/free/docs/aws-azure-gcp-service-comparison) documentation to see equivalent services and comparisons in Microsoft Azure and Amazon Web Services
Pricing Estimates - We have created a sample estimate based on some usage we see from new startups looking to scale. This estimate would give you an idea of how much this deployment would essentially cost per month at this scale and you extend it to the scale you further prefer. Here's the [link](https://cloud.google.com/products/calculator#id=44710202-c9d4-49d5-a378-99d7dd34f5e2).
This solution assumes you already have a project created and set up where you wish to host these resources. If not, and you would like for the project to create a new project as well, please refer to the [github repository](https://github.com/GoogleCloudPlatform/cloud-foundation-fabric/tree/master/blueprints/data-solutions/gcs-to-bq-with-least-privileges) for instructions.
* Have an [organization](https://cloud.google.com/resource-manager/docs/creating-managing-organization) set up in Google cloud.
* Have a [billing account](https://cloud.google.com/billing/docs/how-to/manage-billing-account) set up.
* Have an existing [project](https://cloud.google.com/resource-manager/docs/creating-managing-projects) with [billing enabled](https://cloud.google.com/billing/docs/how-to/modify-project), we’ll call this the __service project__.
### Roles & Permissions
In order to spin up this architecture, you will need to be a user with the “__Project owner__” [IAM](https://cloud.google.com/iam) role on the existing project:
__Note__: To grant a user a role, take a look at the [Granting and Revoking Access](https://cloud.google.com/iam/docs/granting-changing-revoking-access#grant-single-role) documentation.
### Spinning up the architecture
#### Step 1: Cloning the repository
Click on the button below, sign in if required and when the prompt appears, click on “confirm”.
The resource creation will take a few minutes, at the end this is the output you should expect for successful completion along with a list of the created resources:
For the purpose of demonstrating how the ETL pipeline flow works, we’ve set up an example pipeline for you to run. First of all, we assume all the steps are run using a user listed on the __data_eng_principles__ variable (or a user that belongs to one of the groups you specified). Authenticate the user using the following command and make sure your active cloudshell session is set to the __service project__:
gcloud auth application-default login
Follow the instructions in the cloudshell to authenticate the user.
To make the next steps easier, create two environment variables with the service project id and the prefix:
export SERVICE_PROJECT_ID=[SERVICE_PROJECT_ID]
export PREFIX=[PREFIX]
Again, make sure you’re in the following directory:
* A person.csv file containing your data in the form name,surname,timestamp. For example: `Eva,Rivarola,1637771951'.
* A person_udf.js containing the [UDF javascript file](https://cloud.google.com/bigquery/docs/reference/standard-sql/user-defined-functions) used by the Dataflow template.
* A person_schema.json file containing the table schema used to import the CSV.
Once the job completes, you can navigate to BigQuery in the console and under __SERVICE_PROJECT_ID__ → datalake → person, you can see the data that was successfully imported into BigQuery through the Dataflow job.
## Cleaning up your environment
The easiest way to remove all the deployed resources is to run the following command in Cloud Shell:
| [cmek_encryption](variables.tf#L15) | Flag to enable CMEK on GCP resources created. | <code>bool</code> | | <code>false</code> |
| [data_eng_principals](variables.tf#L21) | Groups with Service Account Token creator role on service accounts in IAM format, eg 'group:group@domain.com'. | <code>list(string)</code> | | <code>[]</code> |
| [network_config](variables.tf#L27) | Shared VPC network configurations to use. If null networks will be created in projects with preconfigured values. | <codetitle="object({ host_project = string subnet_self_link = string })">object({…})</code> | | <code>null</code> |
| [project_create](variables.tf#L45) | Provide values if project creation is needed, uses existing project if null. Parent is in 'folders/nnn' or 'organizations/nnn' format. | <codetitle="object({ billing_account_id = string parent = string })">object({…})</code> | | <code>null</code> |
| [region](variables.tf#L59) | The region where resources will be deployed. | <code>string</code> | | <code>"europe-west1"</code> |
| [vpc_subnet_range](variables.tf#L65) | Ip range used for the VPC subnet created for the example. | <code>string</code> | | <code>"10.0.0.0/20"</code> |