I have a Google Dataproc job that reads a CSV file from Google Cloud Storage, containing the following headers
Content-type : application/octet-stream
Content-encoding : gzip
FileName: gs://test_bucket/sample.txt (file doesn't have gz extension but it is compressed)
The below code is running successfully but the dataframe record count(9k) is not matching the file record count(100k). It looks like it is reading only the first 9k rows. How do I make sure I read all the rows into my dataframe?
self.spark :SparkSession= SparkSession.builder.appName("app_name"). \
config("spark.executor.memory","4g") \
.config("spark.hadoop.fs.gs.impl", "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem") \
.config("spark.hadoop.google.cloud.auth.service.account.enable", "true") \
.config("spark.hadoop.fs.gs.inputstream.support.gzip.encoding.enable", "true") \
.config("spark.sql.legacy.timeParserPolicy", "CORRECTED") \
.config("spark.driver.memory","4g").getOrCreate()
df = (self.spark.read.format("csv")
.schema(schema)
.option("mode", 'PERMISSIVE')
.option("encoding", "UTF-8")
.option("columnNameOfCorruptRecord", '_corrupt_record')
.load(self.file_path) )
print("df total count: ", df.count())