pyarrow - Is my understanding about RECORD REPEATED column issue in Bigquery correct?

I am working with the following Pydantic models, which are transformed into PyArrow tables before being written to Parquet ( in GCS Bucket). Pydantic Schema:

from pydantic import BaseModel, Field
from typing import List

class StatEntry(BaseModel):
    value: int
    element: int

class Stat(BaseModel):
    identifier: str
    a: List[StatEntry] = Field(default_factory=list)
    h: List[StatEntry] = Field(default_factory=list)

PyArrow Schema Used to Generate the Parquet File:

@classmethod
    def pyarrow_schema(cls):
        return pa.schema(
            [
                pa.field("code", pa.int32()),
                pa.field("event", pa.int32()),
                pa.field(
                    "stats",
                    pa.list_(
                        pa.struct(
                            [
                                pa.field("identifier", pa.string()),
                                pa.field(
                                    "a",
                                    pa.list_(
                                        pa.struct(
                                            [
                                                pa.field("value", pa.int32()),
                                                pa.field("element", pa.int32()),
                                            ]
                                        )
                                    ),
                                ),
                                pa.field(
                                    "h",
                                    pa.list_(
                                        pa.struct(
                                            [
                                                pa.field("value", pa.int32()),
                                                pa.field("element", pa.int32()),
                                            ]
                                        )
                                    ),
                                ),
                            ]
                        )
                    ),
                ),
                pa.field("ingestion_time", pa.date32()),
            ]
        )

The ultimate goal is to load these parquets to a BigQuery table with this schema

schema = [
    bigquery.SchemaField("code", "INTEGER"),
    bigquery.SchemaField("event", "INTEGER"),
    bigquery.SchemaField(
        "stats",
        "RECORD",
        mode="REPEATED",
        fields=[
            bigquery.SchemaField("identifier", "STRING"),
            bigquery.SchemaField(
                "a",
                "RECORD",
                mode="REPEATED",
                fields=[
                    bigquery.SchemaField("value", "INTEGER"),
                    bigquery.SchemaField("element", "INTEGER"),
                ],
            ),
            bigquery.SchemaField(
                "h",
                "RECORD",
                mode="REPEATED",
                fields=[
                    bigquery.SchemaField("value", "INTEGER"),
                    bigquery.SchemaField("element", "INTEGER"),
                ],
            ),
        ],
    ),
    bigquery.SchemaField("ingestion_time", "DATE"),
]

The field h is defined as pa.list_(pa.struct([...])), meaning it should always be an array of structs. Sometimes, h (or the entire stats field) is an empty list, which I suspect BigQuery might misinterpret as a scalar rather than a repeated field.
In BigQuery, h is declared as REPEATED, meaning it must always be an array, not NULLABLE. Is my understanding correct? If so, are there any workarounds?
When allowing BigQuery to infer the schema from Parquet, additional list and element fields get added with RECORD REPEATED mode, which are not in my predefined PyArrow schema. However, since REPEATED RECORD fields can contain null values, an empty list ([]) vs. NULL shouldn't be the issue by itself. Could this indicate a problem with my PyArrow schema?

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

pyarrow - Is my understanding about RECORD REPEATED column issue in Bigquery correct? - Stack Overflow

与本文相关的文章

评论列表(0)