最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

sql - Populate data in a derived column of JSON data type - Stack Overflow

programmeradmin2浏览0评论

Here is the usecase we are trying to achieve.

Suppose there are a bunch of columns with TAG_ prefix and all are STRING type. (See attached image to get an idea)

We want to populate a derived nested column say "tags_json" which is of JSON data type in the same table.

conditions:

  1. The content of this derived tags_json column is based on contents of corresponding TAG_ prefixed columns.

  2. Each element in a json should have key as corresponding TAG_ column name but without the TAG_ prefix and value as corresponding column's row value.

  3. This JSON column should contain only keys whose values are non null or empty string in main TAG_ column.

  4. Static query is fine.

  5. Use native JSON data type in bigquery.

  6. Can use information schema to first determine available TAG_ columns.

Example: In a particular row, If we have:

TAG_aws_eks_cluster_name = "bottlerocket"

TAG_aws_eks_deployment = "default-backend"

TAG_user_ttl = NULL

Output: in the derived tags_json column should have:

{"aws_eks_cluster_name":"bottlerocket","aws_eks_deployment":"default-backend"}

We want to do it efficiently and something which can work potentially on a tables ranging from 100K rows to ~2 Billion rows.

The number of TAG_ columns can change.

The solution can be combination of SQL + Python or pure SQL.

Tried some iterations of this in various LLMs but nothing concrete came out.

Looking for expert opinion on this and possible solutions which can work at scale.

Here is the usecase we are trying to achieve.

Suppose there are a bunch of columns with TAG_ prefix and all are STRING type. (See attached image to get an idea)

We want to populate a derived nested column say "tags_json" which is of JSON data type in the same table.

conditions:

  1. The content of this derived tags_json column is based on contents of corresponding TAG_ prefixed columns.

  2. Each element in a json should have key as corresponding TAG_ column name but without the TAG_ prefix and value as corresponding column's row value.

  3. This JSON column should contain only keys whose values are non null or empty string in main TAG_ column.

  4. Static query is fine.

  5. Use native JSON data type in bigquery.

  6. Can use information schema to first determine available TAG_ columns.

Example: In a particular row, If we have:

TAG_aws_eks_cluster_name = "bottlerocket"

TAG_aws_eks_deployment = "default-backend"

TAG_user_ttl = NULL

Output: in the derived tags_json column should have:

{"aws_eks_cluster_name":"bottlerocket","aws_eks_deployment":"default-backend"}

We want to do it efficiently and something which can work potentially on a tables ranging from 100K rows to ~2 Billion rows.

The number of TAG_ columns can change.

The solution can be combination of SQL + Python or pure SQL.

Tried some iterations of this in various LLMs but nothing concrete came out.

Looking for expert opinion on this and possible solutions which can work at scale.

Share Improve this question edited Mar 30 at 11:37 Mark Rotteveel 110k229 gold badges156 silver badges223 bronze badges asked Mar 30 at 5:57 badjanbadjan 1151 gold badge1 silver badge8 bronze badges 1
  • Hello Community, Any insights into this ? – badjan Commented Apr 1 at 18:59
Add a comment  | 

1 Answer 1

Reset to default 0

Problem: You need to create a derived tags_json column in BigQuery based on dynamic TAG_ columns. The tags_json should only include keys for non-null, non-empty TAG_ values, where the keys are the TAG_ columns' names (without the TAG_ prefix). You want an efficient solution, ideally working at scale, for tables with anywhere from 100K to 2B rows.

Solution Overview:

  1. Identify TAG_ Columns Dynamically: We can use INFORMATION_SCHEMA.COLUMNS to get the list of all TAG_ columns in the dataset.

  2. Construct the JSON Object: We will dynamically create a JSON object where the key is the column name (without TAG_ prefix), and the value is the value of the column (if it is non-null and non-empty).

  3. Efficient Querying: The solution needs to work efficiently at scale, so we’ll use BigQuery’s powerful JSON functions and ensure that only valid, non-null, and non-empty columns are included.

Here’s the SQL query to populate the derived tags_json column:

WITH tag_columns AS (
  SELECT
    column_name
  FROM
    `project_id.dataset_id.INFORMATION_SCHEMA.COLUMNS`
  WHERE
    table_name = 'your_table_name'
    AND column_name LIKE 'TAG_%'
),

json_columns AS (
  SELECT
    *,
    ARRAY(
      SELECT AS STRUCT
        REPLACE(column_name, 'TAG_', '') AS key,
        CAST(value AS STRING) AS value
      FROM
        UNNEST([STRUCT('TAG_aws_eks_cluster_name' AS column_name, 'bottlerocket' AS value), 
                STRUCT('TAG_aws_eks_deployment' AS column_name, 'default-backend' AS value)])
    ) AS tags_json
  FROM
    `project_id.dataset_id.your_table_name`
)

SELECT
  *,
  JSON_OBJECT_AGG(key, value) AS tags_json
FROM
  json_columns
WHERE
  value IS NOT NULL AND value != ''
GROUP BY
  id;  -- Assuming there's a unique identifier column for your table
发布评论

评论列表(0)

  1. 暂无评论