最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

pyarrow - Is my understanding about RECORD REPEATED column issue in Bigquery correct? - Stack Overflow

programmeradmin0浏览0评论

I am working with the following Pydantic models, which are transformed into PyArrow tables before being written to Parquet ( in GCS Bucket). Pydantic Schema:

from pydantic import BaseModel, Field
from typing import List

class StatEntry(BaseModel):
    value: int
    element: int

class Stat(BaseModel):
    identifier: str
    a: List[StatEntry] = Field(default_factory=list)
    h: List[StatEntry] = Field(default_factory=list)

PyArrow Schema Used to Generate the Parquet File:

@classmethod
    def pyarrow_schema(cls):
        return pa.schema(
            [
                pa.field("code", pa.int32()),
                pa.field("event", pa.int32()),
                pa.field(
                    "stats",
                    pa.list_(
                        pa.struct(
                            [
                                pa.field("identifier", pa.string()),
                                pa.field(
                                    "a",
                                    pa.list_(
                                        pa.struct(
                                            [
                                                pa.field("value", pa.int32()),
                                                pa.field("element", pa.int32()),
                                            ]
                                        )
                                    ),
                                ),
                                pa.field(
                                    "h",
                                    pa.list_(
                                        pa.struct(
                                            [
                                                pa.field("value", pa.int32()),
                                                pa.field("element", pa.int32()),
                                            ]
                                        )
                                    ),
                                ),
                            ]
                        )
                    ),
                ),
                pa.field("ingestion_time", pa.date32()),
            ]
        )

The ultimate goal is to load these parquets to a BigQuery table with this schema

schema = [
    bigquery.SchemaField("code", "INTEGER"),
    bigquery.SchemaField("event", "INTEGER"),
    bigquery.SchemaField(
        "stats",
        "RECORD",
        mode="REPEATED",
        fields=[
            bigquery.SchemaField("identifier", "STRING"),
            bigquery.SchemaField(
                "a",
                "RECORD",
                mode="REPEATED",
                fields=[
                    bigquery.SchemaField("value", "INTEGER"),
                    bigquery.SchemaField("element", "INTEGER"),
                ],
            ),
            bigquery.SchemaField(
                "h",
                "RECORD",
                mode="REPEATED",
                fields=[
                    bigquery.SchemaField("value", "INTEGER"),
                    bigquery.SchemaField("element", "INTEGER"),
                ],
            ),
        ],
    ),
    bigquery.SchemaField("ingestion_time", "DATE"),
]

The field h is defined as pa.list_(pa.struct([...])), meaning it should always be an array of structs. Sometimes, h (or the entire stats field) is an empty list, which I suspect BigQuery might misinterpret as a scalar rather than a repeated field.
In BigQuery, h is declared as REPEATED, meaning it must always be an array, not NULLABLE. Is my understanding correct? If so, are there any workarounds?
When allowing BigQuery to infer the schema from Parquet, additional list and element fields get added with RECORD REPEATED mode, which are not in my predefined PyArrow schema. However, since REPEATED RECORD fields can contain null values, an empty list ([]) vs. NULL shouldn't be the issue by itself. Could this indicate a problem with my PyArrow schema?

发布评论

评论列表(0)

  1. 暂无评论