python - How to do web scraping using pyspark

Hello I've a question how to do web scraping and read the response in pyspark Here's my code

import requests
import pyspark
from pyspark.sql.functions import *
from pyspark.sql import SparkSession

r = requests.get('')
spark=SparkSession.builder.getOrCreate()
df=spark.read.text(r.content)

but I think I am doing it wrong so how can i read it with pyspark?

Hello I've a question how to do web scraping and read the response in pyspark Here's my code

import requests
import pyspark
from pyspark.sql.functions import *
from pyspark.sql import SparkSession

r = requests.get('https://www.skysports/football-scores-fixtures')
spark=SparkSession.builder.getOrCreate()
df=spark.read.text(r.content)

but I think I am doing it wrong so how can i read it with pyspark?

Share Improve this question asked Feb 16 at 10:47 Bahy Mohamed 3114 silver badges11 bronze badges

This question is similar to: Manually create a pyspark dataframe. If you believe it’s different, please edit the question, make it clear how it’s different and/or how the answers on that question are not helpful for your problem. – Steven Commented Feb 17 at 11:40

Add a comment |

2 Answers 2

Sorted by: Reset to default 1

r.content is binary data, and spark.read.text() expects a file path or RDD, not raw HTML content.

You should convert the response to a format that PySpark can handle.

import requests

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("WebScraping").getOrCreate()

r = requests.get('https://www.skysports/football-scores-fixtures')

rdd = spark.sparkContext.parallelize([r.text])

df = spark.createDataFrame(rdd, "string").toDF("html_content")

df.show(truncate=False)

You can't directly ingest this website directly on pyspark. You need to parse it using library like BeautifulSoup.

But if you cant get the API, you can directly ingest using pyspark like this

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - How to do web scraping using pyspark - Stack Overflow

2 Answers 2

与本文相关的文章

评论列表(0)