最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

javascript - How to ignore an invalid SSL certificate with requests_html? - Stack Overflow

programmeradmin0浏览0评论

So basically I'm trying to scrap the javascript generated data from a website. To do this, I'm using the Python library requests_html.

Here is my code :

from requests_html import HTMLSession
session = HTMLSession()

url = 'https://myurl'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
payload = {'mylog': 'root', 'mypass': 'root'}

r = session.post(url, headers=headers, verify=False, data=payload)
r.html.render()
load = r.html.find('#load_span', first=True)

print (load.text)  

If I don't use the render() function, I can connect to the website and my scraped data is null (which is normal) but when I use it, I have this error :

pyppeteer.errors.PageError: net::ERR_CERT_COMMON_NAME_INVALID at https://myurl

or

net::ERR_CERT_WEAK_SIGNATURE_ALGORITHM

I assume the parameter "verify=False" of session.post is ignored by the render. How do I do it ?

Edit : If you want to reproduce the error :

from requests_html import HTMLSession
import requests

session = HTMLSession()

url = ''
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

r = session.post(url, headers=headers, verify=False)

r.html.render()

load = r.html.find('#content', first=True)

print (load)

So basically I'm trying to scrap the javascript generated data from a website. To do this, I'm using the Python library requests_html.

Here is my code :

from requests_html import HTMLSession
session = HTMLSession()

url = 'https://myurl'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
payload = {'mylog': 'root', 'mypass': 'root'}

r = session.post(url, headers=headers, verify=False, data=payload)
r.html.render()
load = r.html.find('#load_span', first=True)

print (load.text)  

If I don't use the render() function, I can connect to the website and my scraped data is null (which is normal) but when I use it, I have this error :

pyppeteer.errors.PageError: net::ERR_CERT_COMMON_NAME_INVALID at https://myurl

or

net::ERR_CERT_WEAK_SIGNATURE_ALGORITHM

I assume the parameter "verify=False" of session.post is ignored by the render. How do I do it ?

Edit : If you want to reproduce the error :

from requests_html import HTMLSession
import requests

session = HTMLSession()

url = 'https://wrong.host.badssl.'
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}

r = session.post(url, headers=headers, verify=False)

r.html.render()

load = r.html.find('#content', first=True)

print (load)
Share Improve this question edited Jun 16, 2020 at 10:26 Jeremy Thompson 65.7k37 gold badges223 silver badges339 bronze badges asked Aug 9, 2018 at 8:51 LayaCazocaLayaCazoca 631 silver badge4 bronze badges 5
  • Wich version of python and request lib are you using? – Alessandro Romano Commented Aug 9, 2018 at 9:02
  • @Alessandro I'm using python3.6 and requests_html 0.9.0 – LayaCazoca Commented Aug 9, 2018 at 9:08
  • Wich OS are you using? It's quite difficult to reproduce this error in the same way that you do. – Alessandro Romano Commented Aug 9, 2018 at 9:19
  • @Alessandro I'm under MacOS, I think it's possible to reproduce with any site with no verified certificate. – LayaCazoca Commented Aug 9, 2018 at 9:23
  • @Alessandro I edited the post to provide a reproduction – LayaCazoca Commented Aug 9, 2018 at 9:33
Add a ment  | 

1 Answer 1

Reset to default 8

The only way is to set the ignoreHTTPSErrors parameter in pyppeteer. The problem is that requests_html doesn't provide any way to set this parameter, in fact, there is an issue about it. My advice is to ping again the developers by adding another message here.

Or maybe you can pull this new feature.

Another way is to use Selenium.

EDIT:
I added verify=False as a feature with a pull request (accepted). Now is possible to ignore the SSL error :)

It's not a parameter of the Get() set it when you instantiate the object:

session = HTMLSession(verify=False)
发布评论

评论列表(0)

  1. 暂无评论