gzip - Paritial records being read in Pyspark through Dataproc

I have a Google Dataproc job that reads a CSV file from Google Cloud Storage, containing the following headers

Content-type : application/octet-stream

Content-encoding : gzip

FileName: gs://test_bucket/sample.txt (file doesn't have gz extension but it is compressed)

The below code is running successfully but the dataframe record count(9k) is not matching the file record count(100k). It looks like it is reading only the first 9k rows. How do I make sure I read all the rows into my dataframe?

    self.spark :SparkSession= SparkSession.builder.appName("app_name"). \
                                                                config("spark.executor.memory","4g") \
                                                                .config("spark.hadoop.fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem") \
                                                                .config("spark.hadoop.google.cloud.auth.service.account.enable", "true") \
                                                                .config("spark.hadoop.fs.gs.inputstream.support.gzip.encoding.enable", "true") \
                                                                .config("spark.sql.legacy.timeParserPolicy", "CORRECTED") \
                                                                .config("spark.driver.memory","4g").getOrCreate()

        df = (self.spark.read.format("csv")
            .schema(schema)
            .option("mode", 'PERMISSIVE')  
            .option("encoding", "UTF-8")
            .option("columnNameOfCorruptRecord", '_corrupt_record')
            .load(self.file_path) )


        print("df total count: ", df.count())

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

gzip - Paritial records being read in Pyspark through Dataproc - Stack Overflow

与本文相关的文章

评论列表(0)