最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

gzip - Paritial records being read in Pyspark through Dataproc - Stack Overflow

programmeradmin2浏览0评论

I have a Google Dataproc job that reads a CSV file from Google Cloud Storage, containing the following headers

Content-type : application/octet-stream

Content-encoding : gzip

FileName: gs://test_bucket/sample.txt (file doesn't have gz extension but it is compressed)

The below code is running successfully but the dataframe record count(9k) is not matching the file record count(100k). It looks like it is reading only the first 9k rows. How do I make sure I read all the rows into my dataframe?

    self.spark :SparkSession= SparkSession.builder.appName("app_name"). \
                                                                config("spark.executor.memory","4g") \
                                                                .config("spark.hadoop.fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem") \
                                                                .config("spark.hadoop.google.cloud.auth.service.account.enable", "true") \
                                                                .config("spark.hadoop.fs.gs.inputstream.support.gzip.encoding.enable", "true") \
                                                                .config("spark.sql.legacy.timeParserPolicy", "CORRECTED") \
                                                                .config("spark.driver.memory","4g").getOrCreate()

        df = (self.spark.read.format("csv")
            .schema(schema)
            .option("mode", 'PERMISSIVE')  
            .option("encoding", "UTF-8")
            .option("columnNameOfCorruptRecord", '_corrupt_record')
            .load(self.file_path) )


        print("df total count: ", df.count())
发布评论

评论列表(0)

  1. 暂无评论