Here is the usecase we are trying to achieve.
Suppose there are a bunch of columns with TAG_ prefix and all are STRING type. (See attached image to get an idea)
We want to populate a derived nested column say "tags_json" which is of JSON data type in the same table.
conditions:
The content of this derived tags_json column is based on contents of corresponding TAG_ prefixed columns.
Each element in a json should have key as corresponding TAG_ column name but without the TAG_ prefix and value as corresponding column's row value.
This JSON column should contain only keys whose values are non null or empty string in main TAG_ column.
Static query is fine.
Use native JSON data type in bigquery.
Can use information schema to first determine available TAG_ columns.
Example: In a particular row, If we have:
TAG_aws_eks_cluster_name = "bottlerocket"
TAG_aws_eks_deployment = "default-backend"
TAG_user_ttl = NULL
Output: in the derived tags_json column should have:
{"aws_eks_cluster_name":"bottlerocket","aws_eks_deployment":"default-backend"}
We want to do it efficiently and something which can work potentially on a tables ranging from 100K rows to ~2 Billion rows.
The number of TAG_ columns can change.
The solution can be combination of SQL + Python or pure SQL.
Tried some iterations of this in various LLMs but nothing concrete came out.
Looking for expert opinion on this and possible solutions which can work at scale.
Here is the usecase we are trying to achieve.
Suppose there are a bunch of columns with TAG_ prefix and all are STRING type. (See attached image to get an idea)
We want to populate a derived nested column say "tags_json" which is of JSON data type in the same table.
conditions:
The content of this derived tags_json column is based on contents of corresponding TAG_ prefixed columns.
Each element in a json should have key as corresponding TAG_ column name but without the TAG_ prefix and value as corresponding column's row value.
This JSON column should contain only keys whose values are non null or empty string in main TAG_ column.
Static query is fine.
Use native JSON data type in bigquery.
Can use information schema to first determine available TAG_ columns.
Example: In a particular row, If we have:
TAG_aws_eks_cluster_name = "bottlerocket"
TAG_aws_eks_deployment = "default-backend"
TAG_user_ttl = NULL
Output: in the derived tags_json column should have:
{"aws_eks_cluster_name":"bottlerocket","aws_eks_deployment":"default-backend"}
We want to do it efficiently and something which can work potentially on a tables ranging from 100K rows to ~2 Billion rows.
The number of TAG_ columns can change.
The solution can be combination of SQL + Python or pure SQL.
Tried some iterations of this in various LLMs but nothing concrete came out.
Looking for expert opinion on this and possible solutions which can work at scale.
Share Improve this question edited Mar 30 at 11:37 Mark Rotteveel 110k229 gold badges156 silver badges223 bronze badges asked Mar 30 at 5:57 badjanbadjan 1151 gold badge1 silver badge8 bronze badges 1- Hello Community, Any insights into this ? – badjan Commented Apr 1 at 18:59
1 Answer
Reset to default 0Problem: You need to create a derived tags_json
column in BigQuery based on dynamic TAG_
columns. The tags_json
should only include keys for non-null, non-empty TAG_
values, where the keys are the TAG_
columns' names (without the TAG_
prefix). You want an efficient solution, ideally working at scale, for tables with anywhere from 100K to 2B rows.
Solution Overview:
Identify
TAG_
Columns Dynamically: We can useINFORMATION_SCHEMA.COLUMNS
to get the list of allTAG_
columns in the dataset.Construct the JSON Object: We will dynamically create a JSON object where the key is the column name (without
TAG_
prefix), and the value is the value of the column (if it is non-null and non-empty).Efficient Querying: The solution needs to work efficiently at scale, so we’ll use BigQuery’s powerful JSON functions and ensure that only valid, non-null, and non-empty columns are included.
Here’s the SQL query to populate the derived tags_json
column:
WITH tag_columns AS (
SELECT
column_name
FROM
`project_id.dataset_id.INFORMATION_SCHEMA.COLUMNS`
WHERE
table_name = 'your_table_name'
AND column_name LIKE 'TAG_%'
),
json_columns AS (
SELECT
*,
ARRAY(
SELECT AS STRUCT
REPLACE(column_name, 'TAG_', '') AS key,
CAST(value AS STRING) AS value
FROM
UNNEST([STRUCT('TAG_aws_eks_cluster_name' AS column_name, 'bottlerocket' AS value),
STRUCT('TAG_aws_eks_deployment' AS column_name, 'default-backend' AS value)])
) AS tags_json
FROM
`project_id.dataset_id.your_table_name`
)
SELECT
*,
JSON_OBJECT_AGG(key, value) AS tags_json
FROM
json_columns
WHERE
value IS NOT NULL AND value != ''
GROUP BY
id; -- Assuming there's a unique identifier column for your table