cloud-foundation-fabric/blueprints/data-solutions/bq-ml/demo/bmql_pipeline.ipynb

465 lines
16 KiB
Plaintext

{
"cells": [
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"**Copyright 2023 Google LLC**"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Copyright 2023 Google LLC\n",
"#\n",
"# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
"# you may not use this file except in compliance with the License.\n",
"# You may obtain a copy of the License at\n",
"#\n",
"# https://www.apache.org/licenses/LICENSE-2.0\n",
"#\n",
"# Unless required by applicable law or agreed to in writing, software\n",
"# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
"# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
"# See the License for the specific language governing permissions and\n",
"# limitations under the License."
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Install python requirements and import packages"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%pip install -r requirements.txt"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import kfp\n",
"import google_cloud_pipeline_components.v1.bigquery as bqop\n",
"\n",
"from google.cloud import aiplatform as aip\n",
"from google.cloud import bigquery"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Set your env variables"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Set your variables\n",
"PREFIX = 'your-prefix'\n",
"PROJECT_ID = 'your-project-id'"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"DATASET = \"{}_data\".format(PREFIX.replace(\"-\",\"_\")) \n",
"EXPERIMENT_NAME = 'bqml-experiment'\n",
"ENDPOINT_DISPLAY_NAME = 'bqml-endpoint'\n",
"LOCATION = 'US'\n",
"MODEL_NAME = 'bqml-model'\n",
"PIPELINE_NAME = 'bqml-vertex-pipeline'\n",
"PIPELINE_ROOT = f\"gs://{PREFIX}-data\"\n",
"REGION = 'us-central1'\n",
"SERVICE_ACCOUNT = f\"vertex-sa@{PROJECT_ID}.iam.gserviceaccount.com\""
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Vertex AI Pipeline Definition\n",
"\n",
"Let's first define the queries for the features and target creation and the query to train the model\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# this query creates the features for our model and the target value we would like to predict\n",
"\n",
"features_query = \"\"\"\n",
"CREATE VIEW if NOT EXISTS `{project_id}.{dataset}.ecommerce_abt` AS\n",
"WITH abt AS (\n",
" SELECT user_id,\n",
" session_id,\n",
" city,\n",
" postal_code,\n",
" browser,\n",
" traffic_source,\n",
" min(created_at) AS session_starting_ts,\n",
" sum(CASE WHEN event_type = 'purchase' THEN 1 ELSE 0 END) has_purchased\n",
" FROM `bigquery-public-data.thelook_ecommerce.events` \n",
" GROUP BY user_id,\n",
" session_id,\n",
" city,\n",
" postal_code,\n",
" browser,\n",
" traffic_source\n",
"), previous_orders AS (\n",
" SELECT user_id,\n",
" array_agg (struct(created_at AS order_creations_ts,\n",
" o.order_id,\n",
" o.status,\n",
" oi.order_cost)) as user_orders\n",
" FROM `bigquery-public-data.thelook_ecommerce.orders` o\n",
" JOIN (SELECT order_id,\n",
" sum(sale_price) order_cost \n",
" FROM `bigquery-public-data.thelook_ecommerce.order_items`\n",
" GROUP BY 1) oi\n",
" ON o.order_id = oi.order_id\n",
" GROUP BY 1\n",
")\n",
"SELECT abt.*,\n",
" CASE WHEN extract(DAYOFWEEK FROM session_starting_ts) IN (1,7)\n",
" THEN 'WEEKEND' \n",
" ELSE 'WEEKDAY'\n",
" END AS day_of_week,\n",
" extract(HOUR FROM session_starting_ts) hour_of_day,\n",
" (SELECT count(DISTINCT uo.order_id) \n",
" FROM unnest(user_orders) uo \n",
" WHERE uo.order_creations_ts < session_starting_ts \n",
" AND status IN ('Shipped', 'Complete', 'Processing')) AS number_of_successful_orders,\n",
" IFNULL((SELECT sum(DISTINCT uo.order_cost) \n",
" FROM unnest(user_orders) uo \n",
" WHERE uo.order_creations_ts < session_starting_ts \n",
" AND status IN ('Shipped', 'Complete', 'Processing')), 0) AS sum_previous_orders,\n",
" (SELECT count(DISTINCT uo.order_id) \n",
" FROM unnest(user_orders) uo \n",
" WHERE uo.order_creations_ts < session_starting_ts \n",
" AND status IN ('Cancelled', 'Returned')) AS number_of_unsuccessful_orders\n",
"FROM abt \n",
"LEFT JOIN previous_orders pso \n",
"ON abt.user_id = pso.user_id\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# this query create the train job on BQ ML\n",
"train_query = \"\"\"\n",
"CREATE OR REPLACE MODEL `{project_id}.{dataset}.{model_name}`\n",
"OPTIONS(MODEL_TYPE='{model_type}',\n",
" INPUT_LABEL_COLS=['has_purchased'],\n",
" ENABLE_GLOBAL_EXPLAIN=TRUE,\n",
" MODEL_REGISTRY='VERTEX_AI',\n",
" DATA_SPLIT_METHOD = 'RANDOM',\n",
" DATA_SPLIT_EVAL_FRACTION = {split_fraction}\n",
" ) AS \n",
"SELECT * EXCEPT (session_id, session_starting_ts, user_id) \n",
"FROM `{project_id}.{dataset}.ecommerce_abt`\n",
"WHERE extract(ISOYEAR FROM session_starting_ts) = 2022\n",
"\"\"\""
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"In the following code block, we are defining our Vertex AI pipeline. It is made up of three main steps:\n",
"1. Create a BigQuery dataset that will contain the BigQuery ML models\n",
"2. Train the BigQuery ML model, in this case, a logistic regression\n",
"3. Evaluate the BigQuery ML model with the standard evaluation metrics\n",
"\n",
"The pipeline takes as input the following variables:\n",
"- ```dataset```: name of the dataset where the artifacts will be stored\n",
"- ```evaluate_job_conf```: bq dict configuration to define where to store evaluation metrics\n",
"- ```location```: BigQuery location\n",
"- ```model_name```: the display name of the BigQuery ML model\n",
"- ```project_id```: the project id where the GCP resources will be created\n",
"- ```split_fraction```: the percentage of data that will be used as an evaluation dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"@kfp.dsl.pipeline(name='bqml-pipeline', pipeline_root=PIPELINE_ROOT)\n",
"def pipeline(\n",
" model_name: str,\n",
" split_fraction: float,\n",
" evaluate_job_conf: dict, \n",
" dataset: str = DATASET,\n",
" project_id: str = PROJECT_ID,\n",
" location: str = LOCATION,\n",
" ):\n",
"\n",
" create_dataset = bqop.BigqueryQueryJobOp(\n",
" project=project_id,\n",
" location=location,\n",
" query=f'CREATE SCHEMA IF NOT EXISTS {dataset}'\n",
" )\n",
"\n",
" create_features_view = bqop.BigqueryQueryJobOp(\n",
" project=project_id,\n",
" location=location,\n",
" query=features_query.format(dataset=dataset, project_id=project_id),\n",
"\n",
" ).after(create_dataset)\n",
"\n",
" create_bqml_model = bqop.BigqueryCreateModelJobOp(\n",
" project=project_id,\n",
" location=location,\n",
" query=train_query.format(model_type = 'LOGISTIC_REG'\n",
" , project_id = project_id\n",
" , dataset = dataset\n",
" , model_name = model_name\n",
" , split_fraction=split_fraction)\n",
" ).after(create_features_view)\n",
"\n",
" evaluate_bqml_model = bqop.BigqueryEvaluateModelJobOp(\n",
" project=project_id,\n",
" location=location,\n",
" model=create_bqml_model.outputs[\"model\"],\n",
" job_configuration_query=evaluate_job_conf\n",
" ).after(create_bqml_model)\n",
"\n",
"\n",
"# this is to compile our pipeline and generate the json description file\n",
"kfp.v2.compiler.Compiler().compile(pipeline_func=pipeline,\n",
" package_path=f'{PIPELINE_NAME}.json') "
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Create Experiment\n",
"\n",
"We will create an experiment to keep track of our training and tasks on a specific issue or problem."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"my_experiment = aip.Experiment.get_or_create(\n",
" experiment_name=EXPERIMENT_NAME,\n",
" description='This is a new experiment to keep track of bqml trainings',\n",
" project=PROJECT_ID,\n",
" location=REGION\n",
")"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Running the same training Vertex AI pipeline with different parameters\n",
"\n",
"One of the main tasks during the training phase is to compare different models or to try the same model with different inputs. We can leverage the power of Vertex AI Pipelines to submit the same steps with different training parameters. Thanks to the experiments artifact, it is possible to easily keep track of all the tests that have been done. This simplifies the process of selecting the best model to deploy.\n",
"\n",
"In this demo case, we will run the same training pipeline while changing the split data percentage between training and test data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# this configuration is needed in order to persist the evaluation metrics on big query\n",
"job_configuration_query = {\"destinationTable\": {\"projectId\": PROJECT_ID, \"datasetId\": DATASET}, \"writeDisposition\": \"WRITE_TRUNCATE\"}\n",
"\n",
"for split_fraction in [0.1, 0.2]:\n",
" job_configuration_query['destinationTable']['tableId'] = MODEL_NAME+'-fraction-{}-eval_table'.format(int(split_fraction*100))\n",
" pipeline = aip.PipelineJob(\n",
" parameter_values = {'split_fraction':split_fraction, 'model_name': MODEL_NAME+'-fraction-{}'.format(int(split_fraction*100)), 'evaluate_job_conf': job_configuration_query },\n",
" display_name=PIPELINE_NAME,\n",
" template_path=f'{PIPELINE_NAME}.json',\n",
" pipeline_root=PIPELINE_ROOT,\n",
" enable_caching=True\n",
" )\n",
"\n",
" pipeline.submit(service_account=SERVICE_ACCOUNT, experiment=my_experiment)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Deploy the model on a Vertex AI endpoint\n",
"\n",
"Thanks to the integration of Vertex AI Endpoint, creating a live endpoint to serve the model we prefer is very straightforward."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# get the model from the Model Registry \n",
"model = aip.Model(model_name=f'{MODEL_NAME}-fraction-10')\n",
"\n",
"# let's create a Vertex Endpoint where we will deploy the ML model\n",
"endpoint = aip.Endpoint.create(\n",
" display_name=ENDPOINT_DISPLAY_NAME,\n",
" project=PROJECT_ID,\n",
" location=REGION,\n",
")"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# deploy the BigQuery ML model on Vertex Endpoint\n",
"# have a coffee - this step can take up 10/15 minutes to finish\n",
"model.deploy(endpoint=endpoint, deployed_model_display_name='bqml-deployed-model')"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Let's get a prediction from new data\n",
"inference_test = {\n",
" 'postal_code': '97700-000',\n",
" 'number_of_successful_orders': 0,\n",
" 'city': 'Santiago',\n",
" 'sum_previous_orders': 1,\n",
" 'number_of_unsuccessful_orders': 0,\n",
" 'day_of_week': 'WEEKDAY',\n",
" 'traffic_source': 'Facebook',\n",
" 'browser': 'Firefox',\n",
" 'hour_of_day': 20\n",
"}"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"my_prediction = endpoint.predict([inference_test])\n",
"\n",
"my_prediction"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# batch prediction on BigQuery\n",
"\n",
"explain_predict_query = \"\"\"\n",
"SELECT *\n",
"FROM ML.EXPLAIN_PREDICT(MODEL `{project_id}.{dataset}.{model_name}`,\n",
" (SELECT * EXCEPT (session_id, session_starting_ts, user_id, has_purchased) \n",
" FROM `{project_id}.{dataset}.ecommerce_abt`\n",
" WHERE extract(ISOYEAR FROM session_starting_ts) = 2023),\n",
" STRUCT(5 AS top_k_features, 0.5 AS threshold))\n",
"LIMIT 100\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# batch prediction on BigQuery\n",
"\n",
"client = bigquery_client = bigquery.Client(location=LOCATION, project=PROJECT_ID)\n",
"batch_predictions = bigquery_client.query(\n",
" explain_predict_query.format(\n",
" project_id=PROJECT_ID,\n",
" dataset=DATASET,\n",
" model_name=f'{MODEL_NAME}-fraction-10')\n",
" ).to_dataframe()\n",
"\n",
"batch_predictions"
]
},
{
"attachments": {},
"cell_type": "markdown",
"metadata": {},
"source": [
"# Conclusions\n",
"\n",
"Thanks to this tutorial we were able to:\n",
"- Define a re-usable Vertex AI pipeline to train and evaluate BQ ML models\n",
"- Use a Vertex AI Experiment to keep track of multiple trainings for the same model with different parameters (in this case a different split for train/test data)\n",
"- Deploy the preferred model on a Vertex AI managed Endpoint in order to serve the model for real-time use cases via API\n",
"- Make batch prediction via Big Query and see what are the top 5 features which influenced the algorithm output"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"name": "python",
"version": "3.8.9"
},
"orig_nbformat": 4,
"vscode": {
"interpreter": {
"hash": "31f2aee4e71d21fbe5cf8b01ff0e069b9275f58929596ceb00d14d90e3e16cd6"
}
}
},
"nbformat": 4,
"nbformat_minor": 2
}