Evening Folks,
I'm attempting to ask Google a question, and pull all the relevant links from its respected search query (i.e. I search "site: Wikipedia Thomas Jefferson" and it gives me wiki/jeff, wiki/tom, etc.)
Here's my code:
from bs4 import BeautifulSoup
from urllib2 import urlopen
query = 'Thomas Jefferson'
query.replace (" ", "+")
#replaces whitespace with a plus sign for Google patibility purposes
soup = BeautifulSoup(urlopen("/?gws_rd=ssl#q=site:wikipedia+" + query), "html.parser")
#creates soup and opens URL for Google. Begins search with site:wikipedia so only wikipedia
#links show up. Uses html parser.
for item in soup.find_all('h3', attrs={'class' : 'r'}):
print item.string
#Guides BS to h3 class "r" where green Wikipedia URLs are located, then prints URLs
#Limiter code to only pull top 5 results
The goal here is for me to set the query variable, have python query Google, and Beautiful Soup pulls all the "green" links, if you will.
Here is a picture of a Google results page
I only wish to pull the green links, in their full extent. What's weird is that Google's Source Code is "hidden" (a symptom of their search architecture), so Beautiful Soup can't just go and pull a href from an h3 tag. I am able to see the h3 hrefs when I Inspect Element, but not when I view source.
Here is a picture of the Inspect Element
My question is: How do I go about pulling the top 5 most relevant green links from Google via BeautifulSoup if I cannot access their Source Code, only Inspect Element?
PS: To give an idea of what I am trying to acplish, I have found two relatively close Stack Overflow questions like mine:
beautiful soup extract a href from google search
How to collect data of Google Search with beautiful soup using python
Evening Folks,
I'm attempting to ask Google a question, and pull all the relevant links from its respected search query (i.e. I search "site: Wikipedia. Thomas Jefferson" and it gives me wiki./jeff, wiki./tom, etc.)
Here's my code:
from bs4 import BeautifulSoup
from urllib2 import urlopen
query = 'Thomas Jefferson'
query.replace (" ", "+")
#replaces whitespace with a plus sign for Google patibility purposes
soup = BeautifulSoup(urlopen("https://www.google./?gws_rd=ssl#q=site:wikipedia.+" + query), "html.parser")
#creates soup and opens URL for Google. Begins search with site:wikipedia. so only wikipedia
#links show up. Uses html parser.
for item in soup.find_all('h3', attrs={'class' : 'r'}):
print item.string
#Guides BS to h3 class "r" where green Wikipedia URLs are located, then prints URLs
#Limiter code to only pull top 5 results
The goal here is for me to set the query variable, have python query Google, and Beautiful Soup pulls all the "green" links, if you will.
Here is a picture of a Google results page
I only wish to pull the green links, in their full extent. What's weird is that Google's Source Code is "hidden" (a symptom of their search architecture), so Beautiful Soup can't just go and pull a href from an h3 tag. I am able to see the h3 hrefs when I Inspect Element, but not when I view source.
Here is a picture of the Inspect Element
My question is: How do I go about pulling the top 5 most relevant green links from Google via BeautifulSoup if I cannot access their Source Code, only Inspect Element?
PS: To give an idea of what I am trying to acplish, I have found two relatively close Stack Overflow questions like mine:
beautiful soup extract a href from google search
How to collect data of Google Search with beautiful soup using python
Share Improve this question edited May 23, 2017 at 12:18 CommunityBot 11 silver badge asked Feb 23, 2016 at 22:55 user5112307user51123073 Answers
Reset to default 5I got a different URL than Rob M. when I tried searching with JavaScript disabled -
https://www.google./search?q=site:wikipedia.+Thomas+Jefferson&gbv=1&sei=YwHNVpHLOYiWmQHk3K24Cw
To make this work with any query, you should first make sure that your query has no spaces in it (that's why you'll get a 400: Bad Request). You can do this using urllib.quote_plus()
:
query = "Thomas Jefferson"
query = urllib.quote_plus(query)
which will urlencode all of the spaces as plus signs - creating a valid URL.
However, this does not work with urllib - you get a 403: Forbidden. I got it to work by using the python-requests
module like this:
import requests
import urllib
from bs4 import BeautifulSoup
query = 'Thomas Jefferson'
query = urllib.quote_plus(query)
r = requests.get('https://www.google./search?q=site:wikipedia.+{}&gbv=1&sei=YwHNVpHLOYiWmQHk3K24Cw'.format(query))
soup = BeautifulSoup(r.text, "html.parser")
#creates soup and opens URL for Google. Begins search with site:wikipedia. so only wikipedia
#links show up. Uses html parser.
links = []
for item in soup.find_all('h3', attrs={'class' : 'r'}):
links.append(item.a['href'][7:]) # [7:] strips the /url?q= prefix
#Guides BS to h3 class "r" where green Wikipedia URLs are located, then prints URLs
#Limiter code to only pull top 5 results
Printing links gives:
print links
# [u'http://en.wikipedia./wiki/Thomas_Jefferson&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggUMAA&usg=AFQjCNG6INz_xj_-p7mpoirb4UqyfGxdWA',
# u'http://www.wikipedia./wiki/Jefferson%25E2%2580%2593Hemings_controversy&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggeMAE&usg=AFQjCNEjCPY-HCdfHoIa60s2DwBU1ffSPg',
# u'http://en.wikipedia./wiki/Sally_Hemings&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggjMAI&usg=AFQjCNGxy4i7AFsup0yPzw9xQq-wD9mtCw',
# u'http://en.wikipedia./wiki/Monticello&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggoMAM&usg=AFQjCNE4YlDpcIUqJRGghuSC43TkG-917g',
# u'http://en.wikipedia./wiki/Thomas_Jefferson_University&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggtMAQ&usg=AFQjCNEDuLjZwImk1G1OnNEnRhtJMvr44g',
# u'http://www.wikipedia./wiki/Jane_Randolph_Jefferson&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFggyMAU&usg=AFQjCNHmXJMI0k4Bf6j3b7QdJffKk97tAw',
# u'http://en.wikipedia./wiki/United_States_presidential_election,_1800&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFgg3MAY&usg=AFQjCNEqsc9jDsDetf0reFep9L9CnlorBA',
# u'http://en.wikipedia./wiki/Isaac_Jefferson&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFgg8MAc&usg=AFQjCNHKAAgylhRjxbxEva5IvDA_UnVrTQ',
# u'http://en.wikipedia./wiki/United_States_presidential_election,_1796&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFghBMAg&usg=AFQjCNHviErFQEKbDlcnDZrqmxGuiBG9XA',
# u'http://en.wikipedia./wiki/United_States_presidential_election,_1804&sa=U&ved=0ahUKEwj4p5-4zI_LAhXCJCYKHUEMCjQQFghGMAk&usg=AFQjCNEJZSxCuXE_Dzm_kw3U7hYkH7OtlQ']
Actually, there's no need to disable JavaScript. It's probably because you need to specify user-agent
to act as a "real" user visit.
When no user-agent
is specified while using requests
library, it defaults to python-requests so Google or other search engines understands that it's a bot/script and might block a request and received HTML will contain some sort of an error with different elements, and that's why you were getting empty results.
Check what's your user-agent
or see a list of user-agents
.
Code and full example in the online IDE:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
response = requests.get( 'https://www.google./search?q=site:wikipedia. thomas edison', headers=headers).text
soup = BeautifulSoup(response, 'lxml')
for links in soup.find_all('div', class_='yuRUbf'):
link = links.a['href']
print(link)
# or using select() method which accepts CSS selectors
for links in soup.select('.yuRUbf a'):
link = links['href']
print(link)
Output:
https://en.wikipedia./wiki/Edison,_New_Jersey
https://en.wikipedia./wiki/Motion_Picture_Patents_Company
https://www.wikipedia./wiki/Thomas_E._Murray
https://en.wikipedia./wiki/Incandescent_light_bulb
https://en.wikipedia./wiki/Phonograph_cylinder
https://en.wikipedia./wiki/Emile_Berliner
https://wikipedia./wiki/Consolidated_Edison
https://www.wikipedia./wiki/hello
https://www.wikipedia./wiki/Tom%20Alston
https://en.wikipedia./wiki/Edison_screw
Alternatively, you can use Google Search Engine Results API from SerpApi. It's a paid API with a free plan.
The difference is that you don't have to figure out what HTML elements to grab in order to extract the data, understand how to bypass blocks from Google or other search engines, and maintain it over time (if something in the HTML will be changed).
Example code to integrate:
import os
from serpapi import GoogleSearch
params = {
"engine": "google",
"q": "site:wikipedia. thomas edison",
"api_key": os.getenv("API_KEY"),
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results["organic_results"]:
print(f"Link: {result['link']}")
Output:
Link: https://en.wikipedia./wiki/Edison,_New_Jersey
Link: https://en.wikipedia./wiki/Motion_Picture_Patents_Company
Link: https://www.wikipedia./wiki/Thomas_E._Murray
Link: https://en.wikipedia./wiki/Incandescent_light_bulb
Link: https://en.wikipedia./wiki/Phonograph_cylinder
Link: https://en.wikipedia./wiki/Emile_Berliner
Link: https://wikipedia./wiki/Consolidated_Edison
Link: https://www.wikipedia./wiki/hello
Link: https://www.wikipedia./wiki/Tom%20Alston
Link: https://en.wikipedia./wiki/Edison_screw
Disclaimer, I work for SerpApi.
P.S. There's a dedicated web scraping blog of mine.
This isn't going to work with the hash search (#q=site:wikipedia.
like you have it) as that loads the data in via AJAX rather than serving you the full parseable HTML with the results, you should use this instead:
soup = BeautifulSoup(urlopen("https://www.google./search?gbv=1&q=site:wikipedia.+" + query), "html.parser")
For reference, I disabled javascript and performed a google search to get this url structure.