I have a JSON file that looks like this:
[
{ "student_id": 1234,
"room_id": "abc",
"enrolled": false
},
{ "student_id": 4321,
"room_id": "def",
"enrolled": true,
"enrollment": {
"type": "home",
"date": "01-01-2020"
}
},
{ "student_id": 678,
"room_id": "htf",
"sports": {
"team": "hockey",
"position": "forward"
}
]
I able to flatten it partially by doing:
df = sc.parallelize(data).map(lambda x: json.dumps(x))
This gives me:
student_id | room_id | enrolled | enrollment | sports |
---|---|---|---|---|
1234 | abc | false | NULL | NULL |
4321 | def | true | {home, 01-01-2020} | NULL |
678 | htf | NULL | NULL | {hockey, forward} |
I have a JSON file that looks like this:
[
{ "student_id": 1234,
"room_id": "abc",
"enrolled": false
},
{ "student_id": 4321,
"room_id": "def",
"enrolled": true,
"enrollment": {
"type": "home",
"date": "01-01-2020"
}
},
{ "student_id": 678,
"room_id": "htf",
"sports": {
"team": "hockey",
"position": "forward"
}
]
I able to flatten it partially by doing:
df = sc.parallelize(data).map(lambda x: json.dumps(x))
This gives me:
student_id | room_id | enrolled | enrollment | sports |
---|---|---|---|---|
1234 | abc | false | NULL | NULL |
4321 | def | true | {home, 01-01-2020} | NULL |
678 | htf | NULL | NULL | {hockey, forward} |
How can I flatten this further to get:
student_id | room_id | enrolled | type | date | team | position |
---|---|---|---|---|---|---|
1234 | abc | false | NULL | NULL | NULL | NULL |
4321 | def | true | home | 01-01-2020 | NULL | NULL |
678 | htf | NULL | NULL | NULL. | hockey | forward |
1 Answer
Reset to default 0You can use pyspark.SparkContext.parallelize for parsing the JSON data. In your case I would also define the schema explicitly as shown next:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType, BooleanType
# Define the schema
schema = StructType([
StructField("student_id", IntegerType(), nullable=False),
StructField("room_id", StringType(), nullable=False),
StructField("enrolled", BooleanType(), nullable=True),
StructField("enrollment", StructType([
StructField("type", StringType(), nullable=True),
StructField("date", StringType(), nullable=True)
]), nullable=True),
StructField("sports", StructType([
StructField("team", StringType(), nullable=True),
StructField("position", StringType(), nullable=True)
]), nullable=True)
])
json = '''
[
{ "student_id": 1234,
"room_id": "abc",
"enrolled": false
},
{ "student_id": 4321,
"room_id": "def",
"enrolled": true,
"enrollment": {
"type": "home",
"date": "01-01-2020"
}
},
{ "student_id": 678,
"room_id": "htf",
"sports": {
"team": "hockey",
"position": "forward"
}
}
]
'''
df = spark.read.schema(schema).json(spark.sparkContext.parallelize([json]))
(
df.selectExpr("student_id",
"room_id",
"enrolled",
"enrollment.type as type",
"enrollment.date as date",
"sports.team as team",
"sports.position as position")
.show()
)
The explicit schema definition helps to handle the appearance (or not) of optional fields enrollment
and sports
Output:
+----------+-------+--------+----+----------+------+--------+
|student_id|room_id|enrolled|type| date| team|position|
+----------+-------+--------+----+----------+------+--------+
| 1234| abc| false|null| null| null| null|
| 4321| def| true|home|01-01-2020| null| null|
| 678| htf| null|null| null|hockey| forward|
+----------+-------+--------+----+----------+------+--------+
pyspark.pandas.json_normalize()
do what you want? – Barmar Commented Mar 5 at 23:53df.select("student_id", "room_id", "enrolled", "enrollment.*", "sports.*")
? – BeRT2me Commented Mar 7 at 19:45