最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

How do I scrape data in <canvas> element with python or javascript? - Stack Overflow

programmeradmin0浏览0评论

I want to scrape data from sites like this (stat game of the game I play) where an interactive chart is being rendered in the <canvas> element and does not show any data as a scrape-able HTML element. Inspecting the HTML, the page appear to use chartjs.

Though help in python is preferred, if I really need to use some javascript, that would be fine too.

Plus, I would like to avoid methods that require extra files such as phantomjs but again, if that's the only way, please be generous enough to share it.

I want to scrape data from sites like this (stat game of the game I play) where an interactive chart is being rendered in the <canvas> element and does not show any data as a scrape-able HTML element. Inspecting the HTML, the page appear to use chartjs.

Though help in python is preferred, if I really need to use some javascript, that would be fine too.

Plus, I would like to avoid methods that require extra files such as phantomjs but again, if that's the only way, please be generous enough to share it.

Share Improve this question edited Jan 24, 2023 at 18:12 OneCricketeer 192k20 gold badges142 silver badges268 bronze badges asked Jan 5, 2020 at 10:33 Anonumous PompAnonumous Pomp 831 gold badge1 silver badge9 bronze badges 7
  • For python you can use selenium – Iain Shelvington Commented Jan 5, 2020 at 10:35
  • can you share url of the page ? – Aleksandar Ciric Commented Jan 5, 2020 at 10:38
  • @IainShelvington I have no clue how to use selenium to scrape data from canvas. I am a noob in web scraping;;; – Anonumous Pomp Commented Jan 5, 2020 at 10:38
  • 1 you can not scrape canvas because it is like image, so you need to use some software for image recognition but you can find all the data inside canvas on you page, for example in this tag //div[@class='playerStatPage']/following-sibling::script or in image elements, for example this //div[@id='ribbons-sm']/div[@class='ribbon-wrapper'] – Aleksandar Ciric Commented Jan 5, 2020 at 10:52
  • 1 and you don't need javascript – Aleksandar Ciric Commented Jan 5, 2020 at 10:53
 |  Show 2 more ments

1 Answer 1

Reset to default 2

One way to to solve this is through checking out the <script> of the page in the page source around line 1050, which is actually where the charts are initialized. There's a recurring pattern in the initialization process of the charts, wherein the canvas elements are queried one by one to get their contexts, followed by the variables that offers the labels and statistics of the charts.

This solution covers using node.js, at least the latest version with the following modules:

  • cheerio for querying elements in the DOM
  • axios for sending an http request to get the page source.
  • abstract-syntax-tree to get a javascript object tree representation of the script that we wish to scrape.

Here's the solution and the source code below:

const cheerio = require('cheerio');

const axios = require('axios');

const { parse, each, find } = require('abstract-syntax-tree');

async function main() {

    // get the page source
    const { data } = await axios.get(
        'https://stats.warbrokers.io/players/i/5d2ead35d142affb05757778'
    );

    // load the page source with cheerio to query the elements
    const $ = cheerio.load(data);

    // get the script tag that contains the string 'Chart.defaults'
    const contents = $('script')
        .toArray()
        .map(script => $(script).html())
        .find(contents => contents.includes('Chart.defaults'));

    // convert the script content to an AST
    const ast = parse(contents);

    // we'll put all declarations in this object
    const declarations = {};

    // current key
    let key = null;

    // iterate over all variable declarations inside a script
    each(ast, 'VariableDeclaration', node => {

        // iterate over possible declarations, e.g. ma separated
        node.declarations.forEach(item => {

            // let's get the key to contain the values of the statistics and their labels
            // we'll use the ID of the canvas itself in this case..
            if(item.id.name === 'ctx') { // is this a canvas context variable?
                // get the only string literal that is not '2d'
                const literal = find(item, 'Literal').find(v => v.value !== '2d');
                if(literal) { // do we have non- '2d' string literals?
                    // then assign it as the current key
                    key = literal.value;
                }
            }

            // ensure that the variable we're getting is an array expression
            if(key && item.init && item.init.type === 'ArrayExpression') {

                // get the array expression
                const array = item.init.elements.map(v => v.value);

                // did we get the values from the statistics?
                if(declarations[key]) {

                    // zip the objects to associate keys and values properly
                    const result = {};
                    for(let index = 0; index < array.length; index++) {
                        result[array[index]] = declarations[key][index];
                    }
                    declarations[key] = result;

                    // let's make the key null again to avoid getting
                    // unnecessary array expression
                    key = null;

                } else {
                    // store the values
                    declarations[key] = array;
                }
            }

        });

    });

    // logging it here, it's up to you how you deal with the data itself
    console.log(declarations);

}

main();
发布评论

评论列表(0)

  1. 暂无评论