typescript - How to Resolve Google News Redirects to Get the Final Article URL Using Axios?

I'm trying to scrape news articles from Google News using Node.js. The issue I am passing is that the links provided by the RSS feed. They give us this type of link which is a Google Rss Link which redirects to the original article.

Example: Link provided by the RSS feed -

Which redirects to -

I attempted to use Axios to follow the redirects and extract the final URL (apnews link) using response.request.res.responseUrl, but this approach doesn't work for Google News links. The responseUrl always remains the same as the original Google News URL.

I used puppeteer to do the same thing, but it is too slow and unnecessary for this. Other than that, I aim to grab the original link, the image from Opengraph, and the description from opengraph. So I was wondering if there is a faster way than puppeteer and using axios or some other library.

async function getRedirectUrl(googleUrl: string): Promise<string> {
    try {
        const response = await axios.get(googleUrl, {
            maxRedirects: 5,
            validateStatus: function (status) {
                return status >= 200 && status < 303;
              }
          });

        console.log(response.request.res.responseUrl)
        return response.request.res.responseUrl || googleUrl
    } catch (error) {
        console.log("Error following redirect:", error)
        return googleUrl
    }
}

Example: Link provided by the RSS feed - https://news.google/rss/articles/CBMiwAFBVV95cUxNc1hWZ0hlVFNubnVpeWcyMWcwOExOOW0wSlNrRWdTdWtPZlhkZ0dROTdnRnlkNFZ5VnpUSHJyYzlpWkpFeVlORnlWRnFGRmVHLTlYWTN3YmVPUjlrcTRVWGo3Qk9rd1pTX2hkM05xSEtOc1NLNXZFSXVIYjdORjdOT21QUDZyV2VwaHltaVRiQXI3ZkJYSW1PX2RLYWZhWmZ0RFY4cGh5NGFmX3RNRk5sNzlLWW14c3gyNTFGdmkzRkk?oc=5

Which redirects to - https://apnews/article/munich-zelenskyy-russia-ukraine-stubb-finland-putin-trump-vance-a96cd82f8011ce75570d45fe45e41625

async function getRedirectUrl(googleUrl: string): Promise<string> {
    try {
        const response = await axios.get(googleUrl, {
            maxRedirects: 5,
            validateStatus: function (status) {
                return status >= 200 && status < 303;
              }
          });

        console.log(response.request.res.responseUrl)
        return response.request.res.responseUrl || googleUrl
    } catch (error) {
        console.log("Error following redirect:", error)
        return googleUrl
    }
}

Share Improve this question asked Feb 16 at 22:05 Deus 133 bronze badges

Add a comment |

2 Answers 2

Sorted by: Reset to default 0

I have already answered this here using python with Requests & BeautifulSoup.

Here's a javascript equivalent using Axios & Cheerio:

const axios = require('axios');
const cheerio = require('cheerio');

async function getArticleUrl(googleRssUrl) {
    const response = await axios.get(googleRssUrl);
    const $ = cheerio.load(response.data);
    const data = $('c-wiz[data-p]').attr('data-p');
    const obj = JSON.parse(data.replace('%.@.', '["garturlreq",'));

    const payload = {
      'f.req': JSON.stringify([[['Fbv4je', JSON.stringify([...obj.slice(0, -6), ...obj.slice(-2)]), 'null', 'generic']]])
    };

    const headers = {
      'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8',
      'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36',
    };

    
    const postResponse = await axios.post('https://news.google/_/DotsSplashUi/data/batchexecute', payload, { headers });
    const arrayString = JSON.parse(postResponse.data.replace(")]}'", ""))[0][2];
    const articleUrl = JSON.parse(arrayString)[1];

    return articleUrl;
}


const rss = 'https://news.google/rss/articles/CBMiwAFBVV95cUxNc1hWZ0hlVFNubnVpeWcyMWcwOExOOW0wSlNrRWdTdWtPZlhkZ0dROTdnRnlkNFZ5VnpUSHJyYzlpWkpFeVlORnlWRnFGRmVHLTlYWTN3YmVPUjlrcTRVWGo3Qk9rd1pTX2hkM05xSEtOc1NLNXZFSXVIYjdORjdOT21QUDZyV2VwaHltaVRiQXI3ZkJYSW1PX2RLYWZhWmZ0RFY4cGh5NGFmX3RNRk5sNzlLWW14c3gyNTFGdmkzRkk?oc=5'
getArticleUrl(rss).then(url => console.log(url));

The news portal link loads by a JS script with a chain or requests before loading the actual link.

You can't use the request alone to get the actual link. You can try to emulate the chain of requests but you need to get all the parameters right to get the URL.

A puppeteer or selenium-like solution is your best option.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

typescript - How to Resolve Google News Redirects to Get the Final Article URL Using Axios? - Stack Overflow

2 Answers 2

与本文相关的文章

评论列表(0)