最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - How to do web scraping using pyspark - Stack Overflow

programmeradmin3浏览0评论

Hello I've a question how to do web scraping and read the response in pyspark Here's my code

import requests
import pyspark
from pyspark.sql.functions import *
from pyspark.sql import SparkSession

r = requests.get('')
spark=SparkSession.builder.getOrCreate()
df=spark.read.text(r.content)

but I think I am doing it wrong so how can i read it with pyspark?

Hello I've a question how to do web scraping and read the response in pyspark Here's my code

import requests
import pyspark
from pyspark.sql.functions import *
from pyspark.sql import SparkSession

r = requests.get('https://www.skysports/football-scores-fixtures')
spark=SparkSession.builder.getOrCreate()
df=spark.read.text(r.content)

but I think I am doing it wrong so how can i read it with pyspark?

Share Improve this question asked Feb 16 at 10:47 Bahy MohamedBahy Mohamed 3114 silver badges11 bronze badges 1
  • This question is similar to: Manually create a pyspark dataframe. If you believe it’s different, please edit the question, make it clear how it’s different and/or how the answers on that question are not helpful for your problem. – Steven Commented Feb 17 at 11:40
Add a comment  | 

2 Answers 2

Reset to default 1

r.content is binary data, and spark.read.text() expects a file path or RDD, not raw HTML content.

You should convert the response to a format that PySpark can handle.

import requests

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("WebScraping").getOrCreate()

r = requests.get('https://www.skysports/football-scores-fixtures')

rdd = spark.sparkContext.parallelize([r.text])

df = spark.createDataFrame(rdd, "string").toDF("html_content")

df.show(truncate=False)

You can't directly ingest this website directly on pyspark. You need to parse it using library like BeautifulSoup.

But if you cant get the API, you can directly ingest using pyspark like this

发布评论

评论列表(0)

  1. 暂无评论