visual studio code - How to readwrite parquet on remote HDFS with pythonpyspark in VSCode?

In Jupyter notebooks I suceed in reading parquet files in HDFS thanks to sparkmagic.

Spark magic conf starts with :

{
  "kernel_python_credentials" : {
    "username": "admin",
    "password": "abcd",
    "url": ":8443/gateway/cdp-proxy-api/livy",
    "auth": "Basic_Access"
  }
...
}

I want to work on vscode with Python scripts and no more notebooks.

How can I do to read and write parquet files in HDFS when I run scripts in VSCODE ?

I tried to setup spark config but didn't work. I have a main.py script that runs a package and starts with :

import os
import yaml
from pyspark.sql import SparkSession


def main():

    spark = SparkSession.builder.config("spark.sql.legacy.timeParserPolicy","LEGACY").getOrCreate()
    spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "LEGACY")
    spark.conf.set("spark.yarn.maxAttempts", 0)

    Class_1(...)

if __name__ == "__main__":
    main()

When I do read.parquet I have the error : [PATH_NOT_FOUND] Path does not exist:

file:/var/projects/test/package/projects/DEV/parquet_folder/

/var/projects/test/package/ is where my python package is cloned

spark.read('projects/DEV/parquet_folder/') works on jupyter notebooks.

In Jupyter notebooks I suceed in reading parquet files in HDFS thanks to sparkmagic.

Spark magic conf starts with :

{
  "kernel_python_credentials" : {
    "username": "admin",
    "password": "abcd",
    "url": "https://test.x.knox.y.fr:8443/gateway/cdp-proxy-api/livy",
    "auth": "Basic_Access"
  }
...
}

I want to work on vscode with Python scripts and no more notebooks.

How can I do to read and write parquet files in HDFS when I run scripts in VSCODE ?

I tried to setup spark config but didn't work. I have a main.py script that runs a package and starts with :

import os
import yaml
from pyspark.sql import SparkSession


def main():

    spark = SparkSession.builder.config("spark.sql.legacy.timeParserPolicy","LEGACY").getOrCreate()
    spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "LEGACY")
    spark.conf.set("spark.yarn.maxAttempts", 0)

    Class_1(...)

if __name__ == "__main__":
    main()

When I do read.parquet I have the error : [PATH_NOT_FOUND] Path does not exist:

file:/var/projects/test/package/projects/DEV/parquet_folder/

/var/projects/test/package/ is where my python package is cloned

spark.read('projects/DEV/parquet_folder/') works on jupyter notebooks.

Share Improve this question edited Mar 12 at 13:39 asked Mar 12 at 12:47 LJRB 1415 bronze badges

Add a comment |

1 Answer 1

Sorted by: Reset to default 0

The path file:/var/projects/test/package/projects/DEV/parquet_folder/ appears to be incorrect. When using the file:// prefix, you should include three slashes (file:///) to indicate an absolute path.

file:///var/projects/test/package/projects/DEV/parquet_folder/

Otherwise just use the path without prefix /var/projects/test/package/projects/DEV/parquet_folder/

and let Spark look for the file/folder in the default filesystem (HDFS, s3, local filesystem) as configured in [core-site.xml](https://hadoop.apache./docs/stable/ hadoop-project-dist/hadoop-common/core-default.xml) (property fs.defaultFS, which is the local filesystem by default).

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

visual studio code - How to readwrite parquet on remote HDFS with pythonpyspark in VSCode? - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)