最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

typescript - How to Resolve Google News Redirects to Get the Final Article URL Using Axios? - Stack Overflow

programmeradmin4浏览0评论

I'm trying to scrape news articles from Google News using Node.js. The issue I am passing is that the links provided by the RSS feed. They give us this type of link which is a Google Rss Link which redirects to the original article.

Example: Link provided by the RSS feed -

Which redirects to -

I attempted to use Axios to follow the redirects and extract the final URL (apnews link) using response.request.res.responseUrl, but this approach doesn't work for Google News links. The responseUrl always remains the same as the original Google News URL.

I used puppeteer to do the same thing, but it is too slow and unnecessary for this. Other than that, I aim to grab the original link, the image from Opengraph, and the description from opengraph. So I was wondering if there is a faster way than puppeteer and using axios or some other library.

async function getRedirectUrl(googleUrl: string): Promise<string> {
    try {
        const response = await axios.get(googleUrl, {
            maxRedirects: 5,
            validateStatus: function (status) {
                return status >= 200 && status < 303;
              }
          });

        console.log(response.request.res.responseUrl)
        return response.request.res.responseUrl || googleUrl
    } catch (error) {
        console.log("Error following redirect:", error)
        return googleUrl
    }
}

I'm trying to scrape news articles from Google News using Node.js. The issue I am passing is that the links provided by the RSS feed. They give us this type of link which is a Google Rss Link which redirects to the original article.

Example: Link provided by the RSS feed - https://news.google/rss/articles/CBMiwAFBVV95cUxNc1hWZ0hlVFNubnVpeWcyMWcwOExOOW0wSlNrRWdTdWtPZlhkZ0dROTdnRnlkNFZ5VnpUSHJyYzlpWkpFeVlORnlWRnFGRmVHLTlYWTN3YmVPUjlrcTRVWGo3Qk9rd1pTX2hkM05xSEtOc1NLNXZFSXVIYjdORjdOT21QUDZyV2VwaHltaVRiQXI3ZkJYSW1PX2RLYWZhWmZ0RFY4cGh5NGFmX3RNRk5sNzlLWW14c3gyNTFGdmkzRkk?oc=5

Which redirects to - https://apnews/article/munich-zelenskyy-russia-ukraine-stubb-finland-putin-trump-vance-a96cd82f8011ce75570d45fe45e41625

I attempted to use Axios to follow the redirects and extract the final URL (apnews link) using response.request.res.responseUrl, but this approach doesn't work for Google News links. The responseUrl always remains the same as the original Google News URL.

I used puppeteer to do the same thing, but it is too slow and unnecessary for this. Other than that, I aim to grab the original link, the image from Opengraph, and the description from opengraph. So I was wondering if there is a faster way than puppeteer and using axios or some other library.

async function getRedirectUrl(googleUrl: string): Promise<string> {
    try {
        const response = await axios.get(googleUrl, {
            maxRedirects: 5,
            validateStatus: function (status) {
                return status >= 200 && status < 303;
              }
          });

        console.log(response.request.res.responseUrl)
        return response.request.res.responseUrl || googleUrl
    } catch (error) {
        console.log("Error following redirect:", error)
        return googleUrl
    }
}
Share Improve this question asked Feb 16 at 22:05 DeusDeus 133 bronze badges
Add a comment  | 

2 Answers 2

Reset to default 0

I have already answered this here using python with Requests & BeautifulSoup.

Here's a javascript equivalent using Axios & Cheerio:

const axios = require('axios');
const cheerio = require('cheerio');

async function getArticleUrl(googleRssUrl) {
    const response = await axios.get(googleRssUrl);
    const $ = cheerio.load(response.data);
    const data = $('c-wiz[data-p]').attr('data-p');
    const obj = JSON.parse(data.replace('%.@.', '["garturlreq",'));

    const payload = {
      'f.req': JSON.stringify([[['Fbv4je', JSON.stringify([...obj.slice(0, -6), ...obj.slice(-2)]), 'null', 'generic']]])
    };

    const headers = {
      'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8',
      'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36',
    };

    
    const postResponse = await axios.post('https://news.google/_/DotsSplashUi/data/batchexecute', payload, { headers });
    const arrayString = JSON.parse(postResponse.data.replace(")]}'", ""))[0][2];
    const articleUrl = JSON.parse(arrayString)[1];

    return articleUrl;
}


const rss = 'https://news.google/rss/articles/CBMiwAFBVV95cUxNc1hWZ0hlVFNubnVpeWcyMWcwOExOOW0wSlNrRWdTdWtPZlhkZ0dROTdnRnlkNFZ5VnpUSHJyYzlpWkpFeVlORnlWRnFGRmVHLTlYWTN3YmVPUjlrcTRVWGo3Qk9rd1pTX2hkM05xSEtOc1NLNXZFSXVIYjdORjdOT21QUDZyV2VwaHltaVRiQXI3ZkJYSW1PX2RLYWZhWmZ0RFY4cGh5NGFmX3RNRk5sNzlLWW14c3gyNTFGdmkzRkk?oc=5'
getArticleUrl(rss).then(url => console.log(url));

The news portal link loads by a JS script with a chain or requests before loading the actual link.

You can't use the request alone to get the actual link. You can try to emulate the chain of requests but you need to get all the parameters right to get the URL.

A puppeteer or selenium-like solution is your best option.

发布评论

评论列表(0)

  1. 暂无评论