最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - How to flatten nested JSON in pyspark - Stack Overflow

programmeradmin2浏览0评论

I have a JSON file that looks like this:

[
 {  "student_id": 1234, 
    "room_id": "abc", 
    "enrolled": false
 }, 
 {   "student_id": 4321, 
    "room_id": "def", 
    "enrolled": true, 
    "enrollment": {
        "type": "home", 
        "date": "01-01-2020"
    }
},
 {   "student_id": 678,
    "room_id": "htf",
    "sports": {
        "team": "hockey",
        "position": "forward"
 }
] 

I able to flatten it partially by doing:

df = sc.parallelize(data).map(lambda x: json.dumps(x))

This gives me:

student_id room_id enrolled enrollment sports
1234 abc false NULL NULL
4321 def true {home, 01-01-2020} NULL
678 htf NULL NULL {hockey, forward}

I have a JSON file that looks like this:

[
 {  "student_id": 1234, 
    "room_id": "abc", 
    "enrolled": false
 }, 
 {   "student_id": 4321, 
    "room_id": "def", 
    "enrolled": true, 
    "enrollment": {
        "type": "home", 
        "date": "01-01-2020"
    }
},
 {   "student_id": 678,
    "room_id": "htf",
    "sports": {
        "team": "hockey",
        "position": "forward"
 }
] 

I able to flatten it partially by doing:

df = sc.parallelize(data).map(lambda x: json.dumps(x))

This gives me:

student_id room_id enrolled enrollment sports
1234 abc false NULL NULL
4321 def true {home, 01-01-2020} NULL
678 htf NULL NULL {hockey, forward}

How can I flatten this further to get:

student_id room_id enrolled type date team position
1234 abc false NULL NULL NULL NULL
4321 def true home 01-01-2020 NULL NULL
678 htf NULL NULL NULL. hockey forward
Share Improve this question edited Mar 5 at 23:38 desertnaut 60.5k32 gold badges155 silver badges182 bronze badges asked Mar 5 at 22:35 unlocknewunlocknew 133 bronze badges 3
  • 2 Does pyspark.pandas.json_normalize() do what you want? – Barmar Commented Mar 5 at 23:53
  • This would work but I am trying to do it without using pandas – unlocknew Commented Mar 6 at 17:56
  • df.select("student_id", "room_id", "enrolled", "enrollment.*", "sports.*")? – BeRT2me Commented Mar 7 at 19:45
Add a comment  | 

1 Answer 1

Reset to default 0

You can use pyspark.SparkContext.parallelize for parsing the JSON data. In your case I would also define the schema explicitly as shown next:

from pyspark.sql.types import StructType, StructField, IntegerType, StringType, BooleanType

# Define the schema
schema = StructType([
    StructField("student_id", IntegerType(), nullable=False),
    StructField("room_id", StringType(), nullable=False),
    StructField("enrolled", BooleanType(), nullable=True),
    StructField("enrollment", StructType([
        StructField("type", StringType(), nullable=True),
        StructField("date", StringType(), nullable=True)
    ]), nullable=True),
    StructField("sports", StructType([
        StructField("team", StringType(), nullable=True),
        StructField("position", StringType(), nullable=True)
    ]), nullable=True)
])

json = '''
[
 {  "student_id": 1234, 
    "room_id": "abc", 
    "enrolled": false
 }, 
 {   "student_id": 4321, 
    "room_id": "def", 
    "enrolled": true, 
    "enrollment": {
        "type": "home", 
        "date": "01-01-2020"
    }
  },
 {  "student_id": 678,
    "room_id": "htf",
    "sports": {
        "team": "hockey",
        "position": "forward"
    }
  }
] 
'''

df = spark.read.schema(schema).json(spark.sparkContext.parallelize([json]))
(
  df.selectExpr("student_id",   
              "room_id",    
              "enrolled", 
              "enrollment.type as type",    
              "enrollment.date as date",
              "sports.team as team",
              "sports.position as position")
  .show()
)

The explicit schema definition helps to handle the appearance (or not) of optional fields enrollment and sports

Output:

+----------+-------+--------+----+----------+------+--------+
|student_id|room_id|enrolled|type|      date|  team|position|
+----------+-------+--------+----+----------+------+--------+
|      1234|    abc|   false|null|      null|  null|    null|
|      4321|    def|    true|home|01-01-2020|  null|    null|
|       678|    htf|    null|null|      null|hockey| forward|
+----------+-------+--------+----+----------+------+--------+
发布评论

评论列表(0)

  1. 暂无评论