Hello I've a question how to do web scraping and read the response in pyspark Here's my code
import requests
import pyspark
from pyspark.sql.functions import *
from pyspark.sql import SparkSession
r = requests.get('')
spark=SparkSession.builder.getOrCreate()
df=spark.read.text(r.content)
but I think I am doing it wrong so how can i read it with pyspark?
Hello I've a question how to do web scraping and read the response in pyspark Here's my code
import requests
import pyspark
from pyspark.sql.functions import *
from pyspark.sql import SparkSession
r = requests.get('https://www.skysports/football-scores-fixtures')
spark=SparkSession.builder.getOrCreate()
df=spark.read.text(r.content)
but I think I am doing it wrong so how can i read it with pyspark?
Share Improve this question asked Feb 16 at 10:47 Bahy MohamedBahy Mohamed 3114 silver badges11 bronze badges 1- This question is similar to: Manually create a pyspark dataframe. If you believe it’s different, please edit the question, make it clear how it’s different and/or how the answers on that question are not helpful for your problem. – Steven Commented Feb 17 at 11:40
2 Answers
Reset to default 1r.content is binary data, and spark.read.text() expects a file path or RDD, not raw HTML content.
You should convert the response to a format that PySpark can handle.
import requests
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("WebScraping").getOrCreate()
r = requests.get('https://www.skysports/football-scores-fixtures')
rdd = spark.sparkContext.parallelize([r.text])
df = spark.createDataFrame(rdd, "string").toDF("html_content")
df.show(truncate=False)
You can't directly ingest this website directly on pyspark. You need to parse it using library like BeautifulSoup.
But if you cant get the API, you can directly ingest using pyspark like this