In Jupyter notebooks I suceed in reading parquet files in HDFS thanks to sparkmagic.
Spark magic conf starts with :
{
"kernel_python_credentials" : {
"username": "admin",
"password": "abcd",
"url": ":8443/gateway/cdp-proxy-api/livy",
"auth": "Basic_Access"
}
...
}
I want to work on vscode with Python scripts and no more notebooks.
How can I do to read and write parquet files in HDFS when I run scripts in VSCODE ?
I tried to setup spark config but didn't work. I have a main.py script that runs a package and starts with :
import os
import yaml
from pyspark.sql import SparkSession
def main():
spark = SparkSession.builder.config("spark.sql.legacy.timeParserPolicy","LEGACY").getOrCreate()
spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "LEGACY")
spark.conf.set("spark.yarn.maxAttempts", 0)
Class_1(...)
if __name__ == "__main__":
main()
When I do read.parquet I have the error : [PATH_NOT_FOUND] Path does not exist:
file:/var/projects/test/package/projects/DEV/parquet_folder/
/var/projects/test/package/ is where my python package is cloned
spark.read('projects/DEV/parquet_folder/')
works on jupyter notebooks.
In Jupyter notebooks I suceed in reading parquet files in HDFS thanks to sparkmagic.
Spark magic conf starts with :
{
"kernel_python_credentials" : {
"username": "admin",
"password": "abcd",
"url": "https://test.x.knox.y.fr:8443/gateway/cdp-proxy-api/livy",
"auth": "Basic_Access"
}
...
}
I want to work on vscode with Python scripts and no more notebooks.
How can I do to read and write parquet files in HDFS when I run scripts in VSCODE ?
I tried to setup spark config but didn't work. I have a main.py script that runs a package and starts with :
import os
import yaml
from pyspark.sql import SparkSession
def main():
spark = SparkSession.builder.config("spark.sql.legacy.timeParserPolicy","LEGACY").getOrCreate()
spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "LEGACY")
spark.conf.set("spark.yarn.maxAttempts", 0)
Class_1(...)
if __name__ == "__main__":
main()
When I do read.parquet I have the error : [PATH_NOT_FOUND] Path does not exist:
file:/var/projects/test/package/projects/DEV/parquet_folder/
/var/projects/test/package/ is where my python package is cloned
spark.read('projects/DEV/parquet_folder/')
works on jupyter notebooks.
1 Answer
Reset to default 0The path file:/var/projects/test/package/projects/DEV/parquet_folder/
appears to be incorrect. When using the file://
prefix, you should include three slashes (file:///
) to indicate an absolute path.
file:///var/projects/test/package/projects/DEV/parquet_folder/
Otherwise just use the path without prefix /var/projects/test/package/projects/DEV/parquet_folder/
and let Spark look for the file/folder in the default filesystem (HDFS, s3, local filesystem) as configured in [core-site.xml](https://hadoop.apache./docs/stable/
hadoop-project-dist/hadoop-common/core-default.xml) (property fs.defaultFS
, which is the local filesystem by default).