最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

javascript - DOMParser for large html - Stack Overflow

programmeradmin3浏览0评论

I have a large amount of html clipboard data from Excel, about 250MB (though it contains a lot of formatting, so when actually pasting it in, the data is much, much smaller than that).

Currently I am using the following DOMParser, which is just one line of code and everything happens behind the scenes:

const doc3 = parser.parseFromString(htmlString, "text/html");

However, it takes ~18s to parse this, and during this time the page is entirely blocking until it finishes -- or, if offloaded to a webworker, an action that gives no progress and just 'waits' for 18s until something ends up happening -- which I would argue is almost the same as freezing even though yes the user can literally interact with the page.

Is there an alternative way to parse a large html/xml file? Perhaps using something that doesn't load everything at once and so can be responsive, or what might be a good solution for this? I suppose the following might be inline with it? But not really sure: .


Update: here is a sample Excel file: . You can download the file, open it in Excel, press Cmd-A (select-all), and Cmd-C (Copy), and it'll paste the data into your clipboard. For me copying it takes up 249MB for the text/html format in the clipboard.

Yes, it is also available in teext/plain (which we use as a backup), but the point of grabbing it from the text/html is to capture the formatting (both data formatting, for example numberType=Percent, 3 decimals and stylistic, for example, background color=red). Please use that as a test for any sample code. Here is the actual test/html content (in asci) when it's in the clipboard here:

I have a large amount of html clipboard data from Excel, about 250MB (though it contains a lot of formatting, so when actually pasting it in, the data is much, much smaller than that).

Currently I am using the following DOMParser, which is just one line of code and everything happens behind the scenes:

const doc3 = parser.parseFromString(htmlString, "text/html");

However, it takes ~18s to parse this, and during this time the page is entirely blocking until it finishes -- or, if offloaded to a webworker, an action that gives no progress and just 'waits' for 18s until something ends up happening -- which I would argue is almost the same as freezing even though yes the user can literally interact with the page.

Is there an alternative way to parse a large html/xml file? Perhaps using something that doesn't load everything at once and so can be responsive, or what might be a good solution for this? I suppose the following might be inline with it? But not really sure: https://github./isaacs/sax-js.


Update: here is a sample Excel file: https://drive.google./file/d/1GIK7q_aU5tLuDNBVtlsDput8Oo1Ocz01/view?usp=sharing. You can download the file, open it in Excel, press Cmd-A (select-all), and Cmd-C (Copy), and it'll paste the data into your clipboard. For me copying it takes up 249MB for the text/html format in the clipboard.

Yes, it is also available in teext/plain (which we use as a backup), but the point of grabbing it from the text/html is to capture the formatting (both data formatting, for example numberType=Percent, 3 decimals and stylistic, for example, background color=red). Please use that as a test for any sample code. Here is the actual test/html content (in asci) when it's in the clipboard here: https://drive.google./file/d/1ZUL2A4Rlk3KPqO4vSSEEGBWuGXj7j5Vh/view?usp=sharing

Share Improve this question edited Mar 19, 2021 at 0:43 carl.hiass asked Mar 15, 2021 at 20:18 carl.hiasscarl.hiass 1,7941 gold badge14 silver badges44 bronze badges 13
  • Yes a stream xml parser can probably help. See my ment here. However you state you want to parse html, but xlsx is made of xml files, and html is a lot harder to parse than xml. So what are tou really trying to do? (Also, Workers don't have access to the DOMParser API anyway) – Kaiido Commented Mar 15, 2021 at 23:17
  • @Kaiido it's the html that is generated from copy-paste in Excel. Here is an example: gyazo./e3b061f3de6eeff0117867c8d7ac9102 – carl.hiass Commented Mar 16, 2021 at 3:51
  • Is it from the application "Numbers"? If so, this data is also accessible as tsv in the clipboard ("text/plain"), probably a lot easier to parse, and a lot smaller for the memory too. If it's Excel or an other app, I can't tell how they populate the clipboard, but might be worth checking for an alternative too. – Kaiido Commented Mar 16, 2021 at 4:14
  • @Kaiido it's from Excel, but yes Google Sheets or any other app should probably have a similar "output as text/html" format. Yes parsing text/plain is much simpler and is our fallback, but back to the question at hand...any way to parse it faster, or at least make it responsive :) ? – carl.hiass Commented Mar 16, 2021 at 5:13
  • 1 Having the resulting html markup would probably be more useful, all softwares don't populate the clipboard the same way, on all platforms. Moreover when in your screenshot we can see your setup creates a <style> tag with rules that got to be matched against the elements below=> not only do you need an HTML parser and not just a simple XML one, but you also need a CSS parser and a CSSOM implementation. If I was in your position, I'd double check with the client if they'd be ok to either omit the styles when pasting big data, or force to send the XML file directly. – Kaiido Commented Mar 18, 2021 at 22:48
 |  Show 8 more ments

2 Answers 2

Reset to default 5 +175

The problem here is not html file size but the large number of DOM nodes it contains. For 900000 rows and 8 columns in your html file we have these figures:

900000 (TR elements) * (8 (TD elements) + 8 (Text nodes)) = ~14 millions of DOM nodes!

I didn't manage to load it with DOMParser, browser tab crashes after a while (FF, Chrome, 16GB RAM), though it would be interesting to look at the browser behavior on successful load. Anyway, I had a similar challenge, to handle millions of records in browser, the solution that I came up with was to build table rows only for one screen at time.

Considering the structure of your text/html file, the approach could be next:

  1. use FileReader to load html file as raw text
  2. grab rows, save them as text array, remove them from output
  3. parse resulting output, insert the table and style into DOM
  4. use a view / paging, render the current batch of rows on paging/scroll or search
  5. attach events for mouse/keyboard control

Below is a simple implementation which provide basic controls like sizing view, paginate/scroll, filter rows with regular expressions. Note that filtering is done on row html, for text only search you can unment the line "//text: text.match...", though in this case the file parsing time will increase a bit.

let tbody, style;
let rows = [], view = [], viewSize = 20, page = 0, time = 0;

const load = fRead => {
    console.timeEnd('FILE LOAD');
    console.time('GRAB ROWS');
    let thead, trows = '', table = fRead.result
        .replace(/<tr[^]+<\/tr>/i, text => (trows += text) && '');
    console.timeEnd('GRAB ROWS');
    console.time('PARSE/INSERT TABLE & STYLE');
    const html = document.createElement('div');
    html.innerHTML = table;
    table = html.querySelector('table');
    if (!table || !trows) {
        setInfo('NO DATA FOUND');
        return;
    }
    if (style = html.querySelector('style'))
        document.head.appendChild(style);
    table.textContent = '';
    el('viewport').appendChild(table);
    console.timeEnd('PARSE/INSERT TABLE & STYLE');
    console.time('PREPARE ROWS ARRAY');
    rows = trows.split('<tr').slice(1).map(text => ({
        html: '<tr' + text, text,
        //text: text.match(/>.*<\/td>/gi).map(s => s.slice(1, -5)).join(' '),
    }));
    console.timeEnd('PREPARE ROWS ARRAY');
    console.time('RENDER TABLE');
    table.appendChild(thead = document.createElement('thead'));
    table.appendChild(tbody = document.createElement('tbody'));
    thead.innerHTML = rows[0].html;
    view = rows = rows.slice(1);
    renew();
    console.timeEnd('RENDER TABLE');
    console.timeEnd('INIT');
};

const reset = info => {
    el('info').textContent = info ?? '';
    el('viewport').textContent = '';
    style?.remove();
    style = null;
    tbody = null;
    view = rows = [];
};

const pages = () => Math.ceil(view.length / viewSize) - 1;

const renew = () => {
    if (!tbody)
        return;
    console.time('RENDER VIEW');
    const i = page * viewSize;
    tbody.innerHTML = view.slice(i, i + viewSize)
        .map(row => row.html).join('');
    console.timeEnd('RENDER VIEW');
    setInfo(`
        rows total: ${rows.length},
        rows match: ${view.length},
        pages: ${pages()}, page: ${page}
    `);
};

const gotoPage = num => {
    el('page').value = page = Math.max(0, Math.min(pages(), num));
    renew();
};

const fileInput = () => {
    reset('LOADING...');
    const fRead = new FileReader();
    fRead.onload = load.bind(null, fRead);
    console.time('INIT');
    console.time('FILE LOAD');
    fRead.readAsText(el('file').files[0]);
};

const fileReset = () => {
    reset();
    el('file').files = new DataTransfer().files;
};

const setInfo = text => el('info').innerHTML = text;

const setView = e => {
    let value = +e.target.value;
    value = Number.isNaN(value * 0) ? 20 : value;
    e.target.value = viewSize = Math.max(1, Math.min(value, 100));
    renew();
};

const setPage = e => {
    const page = +e.target.value;
    gotoPage(Number.isNaN(page * 0) ? 0 : page);
};

const setFilter = e => {
    const filter = e.target.value;
    let match;
    try {
        match = new RegExp(filter);
    } catch (e) {
        setInfo(e);
        return;
    }
    view = rows.filter(row => match.test(row.text));
    page = 0;
    renew();
};

const keys = {'PageUp': -1, 'PageDown': 1};

const scroll = e => {
    const dir = e.key ? keys[e.key] ?? 0 : Math.sign(-e.deltaY);
    if (!dir)
        return;
    e.preventDefault();
    gotoPage(page += dir);
};

const el = id => document.getElementById(id);

el('file').addEventListener('input', fileInput);
el('reset').addEventListener('click', fileReset);
el('view').addEventListener('input', setView);
el('page').addEventListener('input', setPage);
el('filter').addEventListener('input', setFilter);
el('viewport').addEventListener('keydown', scroll);
el('viewport').addEventListener('wheel', scroll);
div {
    display: flex;
    flex: 1;
    align-items: center;
    white-space: nowrap;
}
thead td,
tbody tr td:first-child {
    background: grey;
    color: white;
}
td { padding: 0 .5em; }
#menu > * { margin: 0 .25em; }
#file { min-width: 16em; }
#view, #page { width: 8em; }
#filter { flex: 1; }
#info { padding: .5em; color: red; }
<div id="menu">
    <span>FILE:</span>
        <input id="file" type="file" accept="text/html">
        <button id="reset">RESET</button>
    <span>VIEW:</span><input id="view" type="number" value="20">
    <span>PAGE:</span><input id="page" type="number" value="0">
    <span>FILTER:</span><input id="filter">
</div>
<div id="info"></div>
<div id="viewport" tabindex="0"></div>

As result, for 262 MB html file (900000 table rows) we have next timings in Chromium:

FILE LOAD: 352.57421875 ms

GRAB ROWS: 700.1943359375 ms

PARSE/INSERT TABLE & STYLE: 0.78125 ms

PREPARE ROWS ARRAY: 755.763916015625 ms

RENDER VIEW: 0.926025390625 ms

RENDER TABLE: 4.317138671875 ms

INIT: 1814.19287109375 ms

RENDER VIEW: 5.275146484375 ms

RENDER VIEW: 4.6318359375 ms

So, the time till render of first batch of rows (time to screen) is ~1.8 s, i.e. an order of magnitude lower than the time spent with DOMParser as specified by OP, subsequent rows render is almost instant: ~5 ms

I’d at least try using XMLHttpRequest as a parser. Unlike DOMParser, it’s asynchronous (so the webpage can be interacted with while loading is in progress), it’s capable of reporting progress and reading from Blob objects you get from Clipboard.read, so the overhead of passing around large strings is also minimised.

Last I checked, however, this technique did not always work in all browsers, so don’t throw away DOMParser just yet, if only to have it as a fallback.

Besides DOMParser and XMLHttpRequest, the only native Web API providing DOM parsing functionality is DOM Level 3 Load & Save, which as far as I am aware, no mainstream browser has ever implemented. This means XMLHttpRequest is basically your only option.

Here’s a quick-and-dirty example using XMLHttpRequest as a parser:

const parseHTML = (html, progress) => {
    let cleanup = null;
    let url;

    if (typeof Blob !== 'undefined') {
        if (typeof html === 'string') {
            url = URL.createObjectURL(new Blob([html], { 'type': 'text/html' }));
        } else if (html instanceof Blob) {
            url = URL.createObjectURL(html);
        } else {
            throw new TypeError('html is neither a string nor a Blob');
        }
        cleanup = () => { URL.revokeObjectURL(url); }
    } else if (typeof html === 'string') {
        /* fallback to using data: URIs */
        url = 'data:text/html,' + encodeURIComponent(html);
    } else {
        throw new TypeError('html is neither a string nor a Blob');     
    }
    
    return new Promise((accept, reject) => {
        const xhr = new XMLHttpRequest();
        xhr.open('GET', url);
        xhr.overrideMimeType('text/html');
        xhr.responseType = 'document';
    
        xhr.onload = () => {
            accept(xhr.response || xhr.responseXML);
        };
        
        if (progress) {
            xhr.onprogress = (ev) => {
                /* percentage = ev.loaded / ev.total * 100;
                 * (beware of ev.total === 0)
                 */
                progress(ev);
            };
        }
        
        /* XXX: if the promise is awaited, this makes it
         * throw a ProgressEvent on failure, which is…
         * unusual, though workable */
        xhr.onabort = xhr.onerror = (ev) => {
            reject(ev);
        };
        
        xhr.onloadend = cleanup;
        
        xhr.send(null);
    });
};

When I tested this myself, performance was less than stellar, though somewhat bearable (after the file was loaded, parsing itself took about half a minute, during which the browser was rather unresponsive). I also noticed this would occasionally return null for the empty string, so beware of that as well.

发布评论

评论列表(0)

  1. 暂无评论