I'm trying to scrape news articles from Google News using Node.js. The issue I am passing is that the links provided by the RSS feed. They give us this type of link which is a Google Rss Link which redirects to the original article.
Example: Link provided by the RSS feed -
Which redirects to -
I attempted to use Axios to follow the redirects and extract the final URL (apnews link) using response.request.res.responseUrl, but this approach doesn't work for Google News links. The responseUrl always remains the same as the original Google News URL.
I used puppeteer to do the same thing, but it is too slow and unnecessary for this. Other than that, I aim to grab the original link, the image from Opengraph, and the description from opengraph. So I was wondering if there is a faster way than puppeteer and using axios or some other library.
async function getRedirectUrl(googleUrl: string): Promise<string> {
try {
const response = await axios.get(googleUrl, {
maxRedirects: 5,
validateStatus: function (status) {
return status >= 200 && status < 303;
}
});
console.log(response.request.res.responseUrl)
return response.request.res.responseUrl || googleUrl
} catch (error) {
console.log("Error following redirect:", error)
return googleUrl
}
}
I'm trying to scrape news articles from Google News using Node.js. The issue I am passing is that the links provided by the RSS feed. They give us this type of link which is a Google Rss Link which redirects to the original article.
Example: Link provided by the RSS feed - https://news.google/rss/articles/CBMiwAFBVV95cUxNc1hWZ0hlVFNubnVpeWcyMWcwOExOOW0wSlNrRWdTdWtPZlhkZ0dROTdnRnlkNFZ5VnpUSHJyYzlpWkpFeVlORnlWRnFGRmVHLTlYWTN3YmVPUjlrcTRVWGo3Qk9rd1pTX2hkM05xSEtOc1NLNXZFSXVIYjdORjdOT21QUDZyV2VwaHltaVRiQXI3ZkJYSW1PX2RLYWZhWmZ0RFY4cGh5NGFmX3RNRk5sNzlLWW14c3gyNTFGdmkzRkk?oc=5
Which redirects to - https://apnews/article/munich-zelenskyy-russia-ukraine-stubb-finland-putin-trump-vance-a96cd82f8011ce75570d45fe45e41625
I attempted to use Axios to follow the redirects and extract the final URL (apnews link) using response.request.res.responseUrl, but this approach doesn't work for Google News links. The responseUrl always remains the same as the original Google News URL.
I used puppeteer to do the same thing, but it is too slow and unnecessary for this. Other than that, I aim to grab the original link, the image from Opengraph, and the description from opengraph. So I was wondering if there is a faster way than puppeteer and using axios or some other library.
async function getRedirectUrl(googleUrl: string): Promise<string> {
try {
const response = await axios.get(googleUrl, {
maxRedirects: 5,
validateStatus: function (status) {
return status >= 200 && status < 303;
}
});
console.log(response.request.res.responseUrl)
return response.request.res.responseUrl || googleUrl
} catch (error) {
console.log("Error following redirect:", error)
return googleUrl
}
}
Share
Improve this question
asked Feb 16 at 22:05
DeusDeus
133 bronze badges
2 Answers
Reset to default 0I have already answered this here using python with Requests
& BeautifulSoup
.
Here's a javascript equivalent using Axios
& Cheerio
:
const axios = require('axios');
const cheerio = require('cheerio');
async function getArticleUrl(googleRssUrl) {
const response = await axios.get(googleRssUrl);
const $ = cheerio.load(response.data);
const data = $('c-wiz[data-p]').attr('data-p');
const obj = JSON.parse(data.replace('%.@.', '["garturlreq",'));
const payload = {
'f.req': JSON.stringify([[['Fbv4je', JSON.stringify([...obj.slice(0, -6), ...obj.slice(-2)]), 'null', 'generic']]])
};
const headers = {
'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36',
};
const postResponse = await axios.post('https://news.google/_/DotsSplashUi/data/batchexecute', payload, { headers });
const arrayString = JSON.parse(postResponse.data.replace(")]}'", ""))[0][2];
const articleUrl = JSON.parse(arrayString)[1];
return articleUrl;
}
const rss = 'https://news.google/rss/articles/CBMiwAFBVV95cUxNc1hWZ0hlVFNubnVpeWcyMWcwOExOOW0wSlNrRWdTdWtPZlhkZ0dROTdnRnlkNFZ5VnpUSHJyYzlpWkpFeVlORnlWRnFGRmVHLTlYWTN3YmVPUjlrcTRVWGo3Qk9rd1pTX2hkM05xSEtOc1NLNXZFSXVIYjdORjdOT21QUDZyV2VwaHltaVRiQXI3ZkJYSW1PX2RLYWZhWmZ0RFY4cGh5NGFmX3RNRk5sNzlLWW14c3gyNTFGdmkzRkk?oc=5'
getArticleUrl(rss).then(url => console.log(url));
The news portal link loads by a JS script with a chain or requests before loading the actual link.
You can't use the request alone to get the actual link. You can try to emulate the chain of requests but you need to get all the parameters right to get the URL.
A puppeteer or selenium-like solution is your best option.