python - How to flatten nested JSON in pyspark

I have a JSON file that looks like this:

[
 {  "student_id": 1234, 
    "room_id": "abc", 
    "enrolled": false
 }, 
 {   "student_id": 4321, 
    "room_id": "def", 
    "enrolled": true, 
    "enrollment": {
        "type": "home", 
        "date": "01-01-2020"
    }
},
 {   "student_id": 678,
    "room_id": "htf",
    "sports": {
        "team": "hockey",
        "position": "forward"
 }
]

I able to flatten it partially by doing:

df = sc.parallelize(data).map(lambda x: json.dumps(x))

This gives me:

student_id	room_id	enrolled	enrollment	sports
1234	abc	false	NULL	NULL
4321	def	true	{home, 01-01-2020}	NULL
678	htf	NULL	NULL	{hockey, forward}

I have a JSON file that looks like this:

[
 {  "student_id": 1234, 
    "room_id": "abc", 
    "enrolled": false
 }, 
 {   "student_id": 4321, 
    "room_id": "def", 
    "enrolled": true, 
    "enrollment": {
        "type": "home", 
        "date": "01-01-2020"
    }
},
 {   "student_id": 678,
    "room_id": "htf",
    "sports": {
        "team": "hockey",
        "position": "forward"
 }
]

I able to flatten it partially by doing:

df = sc.parallelize(data).map(lambda x: json.dumps(x))

This gives me:

student_id	room_id	enrolled	enrollment	sports
1234	abc	false	NULL	NULL
4321	def	true	{home, 01-01-2020}	NULL
678	htf	NULL	NULL	{hockey, forward}

How can I flatten this further to get:

student_id	room_id	enrolled	type	date	team	position
1234	abc	false	NULL	NULL	NULL	NULL
4321	def	true	home	01-01-2020	NULL	NULL
678	htf	NULL	NULL	NULL.	hockey	forward

Share Improve this question edited Mar 5 at 23:38 desertnaut 60.5k32 gold badges155 silver badges182 bronze badges asked Mar 5 at 22:35 unlocknew 133 bronze badges

2 Does pyspark.pandas.json_normalize() do what you want? – Barmar Commented Mar 5 at 23:53
This would work but I am trying to do it without using pandas – unlocknew Commented Mar 6 at 17:56
df.select("student_id", "room_id", "enrolled", "enrollment.*", "sports.*")? – BeRT2me Commented Mar 7 at 19:45

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

You can use pyspark.SparkContext.parallelize for parsing the JSON data. In your case I would also define the schema explicitly as shown next:

from pyspark.sql.types import StructType, StructField, IntegerType, StringType, BooleanType

# Define the schema
schema = StructType([
    StructField("student_id", IntegerType(), nullable=False),
    StructField("room_id", StringType(), nullable=False),
    StructField("enrolled", BooleanType(), nullable=True),
    StructField("enrollment", StructType([
        StructField("type", StringType(), nullable=True),
        StructField("date", StringType(), nullable=True)
    ]), nullable=True),
    StructField("sports", StructType([
        StructField("team", StringType(), nullable=True),
        StructField("position", StringType(), nullable=True)
    ]), nullable=True)
])

json = '''
[
 {  "student_id": 1234, 
    "room_id": "abc", 
    "enrolled": false
 }, 
 {   "student_id": 4321, 
    "room_id": "def", 
    "enrolled": true, 
    "enrollment": {
        "type": "home", 
        "date": "01-01-2020"
    }
  },
 {  "student_id": 678,
    "room_id": "htf",
    "sports": {
        "team": "hockey",
        "position": "forward"
    }
  }
] 
'''

df = spark.read.schema(schema).json(spark.sparkContext.parallelize([json]))
(
  df.selectExpr("student_id",   
              "room_id",    
              "enrolled", 
              "enrollment.type as type",    
              "enrollment.date as date",
              "sports.team as team",
              "sports.position as position")
  .show()
)

The explicit schema definition helps to handle the appearance (or not) of optional fields enrollment and sports

Output:

+----------+-------+--------+----+----------+------+--------+
|student_id|room_id|enrolled|type|      date|  team|position|
+----------+-------+--------+----+----------+------+--------+
|      1234|    abc|   false|null|      null|  null|    null|
|      4321|    def|    true|home|01-01-2020|  null|    null|
|       678|    htf|    null|null|      null|hockey| forward|
+----------+-------+--------+----+----------+------+--------+

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - How to flatten nested JSON in pyspark - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)