最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

javascript - Pull DataLinks from Google Searches using Beautiful Soup - Stack Overflow

programmeradmin1浏览0评论

Evening Folks,

I'm attempting to ask Google a question, and pull all the relevant links from its respected search query (i.e. I search "site: Wikipedia Thomas Jefferson" and it gives me wiki/jeff, wiki/tom, etc.)

Here's my code:

from bs4 import BeautifulSoup
from urllib2 import urlopen

query = 'Thomas Jefferson'

query.replace (" ", "+")
#replaces whitespace with a plus sign for Google patibility purposes

soup = BeautifulSoup(urlopen("/?gws_rd=ssl#q=site:wikipedia+" + query), "html.parser")
#creates soup and opens URL for Google. Begins search with site:wikipedia so only wikipedia
#links show up. Uses html parser.

for item in soup.find_all('h3', attrs={'class' : 'r'}):
    print item.string
#Guides BS to h3 class "r" where green Wikipedia URLs are located, then prints URLs
#Limiter code to only pull top 5 results

The goal here is for me to set the query variable, have python query Google, and Beautiful Soup pulls all the "green" links, if you will.

Here is a picture of a Google results page

I only wish to pull the green links, in their full extent. What's weird is that Google's Source Code is "hidden" (a symptom of their search architecture), so Beautiful Soup can't just go and pull a href from an h3 tag. I am able to see the h3 hrefs when I Inspect Element, but not when I view source.

Here is a picture of the Inspect Element

My question is: How do I go about pulling the top 5 most relevant green links from Google via BeautifulSoup if I cannot access their Source Code, only Inspect Element?

PS: To give an idea of what I am trying to acplish, I have found two relatively close Stack Overflow questions like mine:

beautiful soup extract a href from google search

How to collect data of Google Search with beautiful soup using python

Evening Folks,

I'm attempting to ask Google a question, and pull all the relevant links from its respected search query (i.e. I search "site: Wikipedia. Thomas Jefferson" and it gives me wiki./jeff, wiki./tom, etc.)

Here's my code:

from bs4 import BeautifulSoup
from urllib2 import urlopen

query = 'Thomas Jefferson'

query.replace (" ", "+")
#replaces whitespace with a plus sign for Google patibility purposes

soup = BeautifulSoup(urlopen("https://www.google./?gws_rd=ssl#q=site:wikipedia.+" + query), "html.parser")
#creates soup and opens URL for Google. Begins search with site:wikipedia. so only wikipedia
#links show up. Uses html parser.

for item in soup.find_all('h3', attrs={'class' : 'r'}):
    print item.string
#Guides BS to h3 class "r" where green Wikipedia URLs are located, then prints URLs
#Limiter code to only pull top 5 results

The goal here is for me to set the query variable, have python query Google, and Beautiful Soup pulls all the "green" links, if you will.

Here is a picture of a Google results page

I only wish to pull the green links, in their full extent. What's weird is that Google's Source Code is "hidden" (a symptom of their search architecture), so Beautiful Soup can't just go and pull a href from an h3 tag. I am able to see the h3 hrefs when I Inspect Element, but not when I view source.

Here is a picture of the Inspect Element

My question is: How do I go about pulling the top 5 most relevant green links from Google via BeautifulSoup if I cannot access their Source Code, only Inspect Element?

PS: To give an idea of what I am trying to acplish, I have found two relatively close Stack Overflow questions like mine:

beautiful soup extract a href from google search

How to collect data of Google Search with beautiful soup using python

Share Improve this question edited May 23, 2017 at 12:18 CommunityBot 11 silver badge asked Feb 23, 2016 at 22:55 user5112307user5112307
Add a ment  | 

3 Answers 3

Reset to default 5

I got a different URL than Rob M. when I tried searching with JavaScript disabled -

https://www.google./search?q=site:wikipedia.+Thomas+Jefferson&gbv=1&sei=YwHNVpHLOYiWmQHk3K24Cw

To make this work with any query, you should first make sure that your query has no spaces in it (that's why you'll get a 400: Bad Request). You can do this using urllib.quote_plus():

query = "Thomas Jefferson"
query = urllib.quote_plus(query)

which will urlencode all of the spaces as plus signs - creating a valid URL.

However, this does not work with urllib - you get a 403: Forbidden. I got it to work by using the python-requests module like this:

import requests
import urllib
from bs4 import BeautifulSoup

query = 'Thomas Jefferson'
query = urllib.quote_plus(query)

r = requests.get('https://www.google./search?q=site:wikipedia.+{}&gbv=1&sei=YwHNVpHLOYiWmQHk3K24Cw'.format(query))
soup = BeautifulSoup(r.text, "html.parser")
#creates soup and opens URL for Google. Begins search with site:wikipedia. so only wikipedia
#links show up. Uses html parser.

links = []
for item in soup.find_all('h3', attrs={'class' : 'r'}):
    links.append(item.a['href'][7:]) # [7:] strips the /url?q= prefix
#Guides BS to h3 class "r" where green Wikipedia URLs are located, then prints URLs
#Limiter code to only pull top 5 results

Printing links gives:

print links
#  [u'http://en.wikipedia./wiki/Thomas_Jefferson&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggUMAA&usg=AFQjCNG6INz_xj_-p7mpoirb4UqyfGxdWA',
#   u'http://www.wikipedia./wiki/Jefferson%25E2%2580%2593Hemings_controversy&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggeMAE&usg=AFQjCNEjCPY-HCdfHoIa60s2DwBU1ffSPg',
#   u'http://en.wikipedia./wiki/Sally_Hemings&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggjMAI&usg=AFQjCNGxy4i7AFsup0yPzw9xQq-wD9mtCw',
#   u'http://en.wikipedia./wiki/Monticello&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggoMAM&usg=AFQjCNE4YlDpcIUqJRGghuSC43TkG-917g',
#   u'http://en.wikipedia./wiki/Thomas_Jefferson_University&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggtMAQ&usg=AFQjCNEDuLjZwImk1G1OnNEnRhtJMvr44g',
#   u'http://www.wikipedia./wiki/Jane_Randolph_Jefferson&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggyMAU&usg=AFQjCNHmXJMI0k4Bf6j3b7QdJffKk97tAw',
#   u'http://en.wikipedia./wiki/United_States_presidential_election,_1800&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFgg3MAY&usg=AFQjCNEqsc9jDsDetf0reFep9L9CnlorBA',
#   u'http://en.wikipedia./wiki/Isaac_Jefferson&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFgg8MAc&usg=AFQjCNHKAAgylhRjxbxEva5IvDA_UnVrTQ',
#   u'http://en.wikipedia./wiki/United_States_presidential_election,_1796&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFghBMAg&usg=AFQjCNHviErFQEKbDlcnDZrqmxGuiBG9XA',
#   u'http://en.wikipedia./wiki/United_States_presidential_election,_1804&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFghGMAk&usg=AFQjCNEJZSxCuXE_Dzm_kw3U7hYkH7OtlQ']

Actually, there's no need to disable JavaScript. It's probably because you need to specify user-agent to act as a "real" user visit.

When no user-agent is specified while using requests library, it defaults to python-requests so Google or other search engines understands that it's a bot/script and might block a request and received HTML will contain some sort of an error with different elements, and that's why you were getting empty results.

Check what's your user-agent or see a list of user-agents.


Code and full example in the online IDE:

from bs4 import BeautifulSoup
import requests, lxml

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

response = requests.get( 'https://www.google./search?q=site:wikipedia. thomas edison', headers=headers).text
soup = BeautifulSoup(response, 'lxml')

for links in soup.find_all('div', class_='yuRUbf'):
    link = links.a['href']
    print(link)

# or using select() method which accepts CSS selectors

for links in soup.select('.yuRUbf a'):
    link = links['href']
    print(link)

Output:

https://en.wikipedia./wiki/Edison,_New_Jersey
https://en.wikipedia./wiki/Motion_Picture_Patents_Company
https://www.wikipedia./wiki/Thomas_E._Murray
https://en.wikipedia./wiki/Incandescent_light_bulb
https://en.wikipedia./wiki/Phonograph_cylinder
https://en.wikipedia./wiki/Emile_Berliner
https://wikipedia./wiki/Consolidated_Edison
https://www.wikipedia./wiki/hello
https://www.wikipedia./wiki/Tom%20Alston
https://en.wikipedia./wiki/Edison_screw

Alternatively, you can use Google Search Engine Results API from SerpApi. It's a paid API with a free plan.

The difference is that you don't have to figure out what HTML elements to grab in order to extract the data, understand how to bypass blocks from Google or other search engines, and maintain it over time (if something in the HTML will be changed).

Example code to integrate:

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "site:wikipedia. thomas edison",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results["organic_results"]:
        print(f"Link: {result['link']}")

Output:

Link: https://en.wikipedia./wiki/Edison,_New_Jersey
Link: https://en.wikipedia./wiki/Motion_Picture_Patents_Company
Link: https://www.wikipedia./wiki/Thomas_E._Murray
Link: https://en.wikipedia./wiki/Incandescent_light_bulb
Link: https://en.wikipedia./wiki/Phonograph_cylinder
Link: https://en.wikipedia./wiki/Emile_Berliner
Link: https://wikipedia./wiki/Consolidated_Edison
Link: https://www.wikipedia./wiki/hello
Link: https://www.wikipedia./wiki/Tom%20Alston
Link: https://en.wikipedia./wiki/Edison_screw

Disclaimer, I work for SerpApi.


P.S. There's a dedicated web scraping blog of mine.

This isn't going to work with the hash search (#q=site:wikipedia. like you have it) as that loads the data in via AJAX rather than serving you the full parseable HTML with the results, you should use this instead:

soup = BeautifulSoup(urlopen("https://www.google./search?gbv=1&q=site:wikipedia.+" + query), "html.parser")

For reference, I disabled javascript and performed a google search to get this url structure.

发布评论

评论列表(0)

  1. 暂无评论