I am working with the following Pydantic models, which are transformed into PyArrow tables before being written to Parquet ( in GCS Bucket). Pydantic Schema:
from pydantic import BaseModel, Field
from typing import List
class StatEntry(BaseModel):
value: int
element: int
class Stat(BaseModel):
identifier: str
a: List[StatEntry] = Field(default_factory=list)
h: List[StatEntry] = Field(default_factory=list)
PyArrow Schema Used to Generate the Parquet File:
@classmethod
def pyarrow_schema(cls):
return pa.schema(
[
pa.field("code", pa.int32()),
pa.field("event", pa.int32()),
pa.field(
"stats",
pa.list_(
pa.struct(
[
pa.field("identifier", pa.string()),
pa.field(
"a",
pa.list_(
pa.struct(
[
pa.field("value", pa.int32()),
pa.field("element", pa.int32()),
]
)
),
),
pa.field(
"h",
pa.list_(
pa.struct(
[
pa.field("value", pa.int32()),
pa.field("element", pa.int32()),
]
)
),
),
]
)
),
),
pa.field("ingestion_time", pa.date32()),
]
)
The ultimate goal is to load these parquets to a BigQuery table with this schema
schema = [
bigquery.SchemaField("code", "INTEGER"),
bigquery.SchemaField("event", "INTEGER"),
bigquery.SchemaField(
"stats",
"RECORD",
mode="REPEATED",
fields=[
bigquery.SchemaField("identifier", "STRING"),
bigquery.SchemaField(
"a",
"RECORD",
mode="REPEATED",
fields=[
bigquery.SchemaField("value", "INTEGER"),
bigquery.SchemaField("element", "INTEGER"),
],
),
bigquery.SchemaField(
"h",
"RECORD",
mode="REPEATED",
fields=[
bigquery.SchemaField("value", "INTEGER"),
bigquery.SchemaField("element", "INTEGER"),
],
),
],
),
bigquery.SchemaField("ingestion_time", "DATE"),
]
The field h is defined as pa.list_(pa.struct([...]))
, meaning it should always be an array of structs. Sometimes, h (or the entire stats field) is an empty list, which I suspect BigQuery might misinterpret as a scalar rather than a repeated field.
In BigQuery, h is declared as REPEATED, meaning it must always be an array, not NULLABLE. Is my understanding correct? If so, are there any workarounds?
When allowing BigQuery to infer the schema from Parquet, additional list and element
fields get added with RECORD REPEATED mode, which are not in my predefined PyArrow schema. However, since REPEATED RECORD fields can contain null values, an empty list ([]) vs. NULL shouldn't be the issue by itself. Could this indicate a problem with my PyArrow schema?