javascript - Extract the whole text contained in webpage using Chrome extension

I'm developing a Chrome extension for text parsing of Google search results. I want the user to insert a certain text in the omnibox, and then be direct to a Google search page.

function navigate(url) {
    chrome.tabs.query({active: true, currentWindow: true}, function(tabs) { 
    chrome.tabs.update(tabs[0].id, {url: url});
    });
}

chrome.omnibox.onInputEntered.addListener(function(text) {
    navigate(";lr=lang_pt&q=" + text + "%20%2B%20cnpj");
});

alert('Here is where the text will be extracted');

After directing the current tab to the search page, I want to get the plain text form of the page, to parse it afterwards. What is the most straightforward way to acplish this?

I'm developing a Chrome extension for text parsing of Google search results. I want the user to insert a certain text in the omnibox, and then be direct to a Google search page.

function navigate(url) {
    chrome.tabs.query({active: true, currentWindow: true}, function(tabs) { 
    chrome.tabs.update(tabs[0].id, {url: url});
    });
}

chrome.omnibox.onInputEntered.addListener(function(text) {
    navigate("https://www.google..br/search?hl=pt-BR&lr=lang_pt&q=" + text + "%20%2B%20cnpj");
});

alert('Here is where the text will be extracted');

After directing the current tab to the search page, I want to get the plain text form of the page, to parse it afterwards. What is the most straightforward way to acplish this?

Share Improve this question asked Sep 4, 2016 at 13:41 Filipe Aleixo 4,2626 gold badges50 silver badges81 bronze badges

2 Use a content script. You should have known this if you read the extensions overview (this IS important), which I've linked in my ment to your previous question. – woxxom Commented Sep 4, 2016 at 13:46
1 There are many examples you can find. Here's one: How to check how many elements of a certain type has a page with Chrome Extension Dev - just don't forget only plain objects may be passed (not DOM elements) so use e.g. document.body.innerHTML for the code parameter or [].map.call(document.querySelectorAll('div.rc a'), function(a) { return {url: a.href, text: a.textContent} }); – woxxom Commented Sep 4, 2016 at 14:08

Add a ment |

1 Answer 1

Sorted by: Reset to default 4

Well, parsing the webpage is probably going to be easier to do as a DOM instead of plain text. However, that is not what your question asked.

Your code has issues with how you are navigating to the page and dealing with the asynchronous nature of web navigation. This is also not what your question asked, but impacts how what you did ask about, getting text from a webpage, is implemented.

As such, to answer your question of how to extract the plain text from a webpage, I implemented doing so upon the user clicking a browser_action button. This separates answering how this can be done from the other issues in your code.

As wOxxOm mentioned in a ment, to have access to the DOM of a webpage, you have to use a content script. As he did, I suggest you read the Overview of Chrome extensions. You can inject a content script using chrome.tabs.executeScript. Normally, you would inject a script contained in a separate file using the file property of the details parameter. For code that is just the simple task of sending back the text of the webpage (without parsing, etc), it is reasonable to just insert the single line of code that is required for the most basic way of doing so. To insert a short segment of code, you can do so using the code property of the details parameter. In this case, given that you have said nothing about your requirements for the text, document.body.innerText is the text returned.

To send the text back to the background script, chrome.runtime.sendMessage() is used.

To receive the text in the background script, a listener, receiveText, is added to chrome.runtime.onMessage.

background.js:

chrome.browserAction.onClicked.addListener(function(tab) {
    console.log('Injecting content script(s)');
    //On Firefox document.body.textContent is probably more appropriate
    chrome.tabs.executeScript(tab.id,{
        code: 'document.body.innerText;'
        //If you had something somewhat more plex you can use an IIFE:
        //code: '(function (){return document.body.innerText})();'
        //If your code was plex, you should store it in a
        // separate .js file, which you inject with the file: property.
    },receiveText);
});

//tabs.executeScript() returns the results of the executed script
//  in an array of results, one entry per frame in which the script
//  was injected.
function receiveText(resultsArray){
    console.log(resultsArray[0]);
}

manifest.json:

{
    "description": "Gets the text of a webpage and logs it to the console",
    "manifest_version": 2,
    "name": "Get Webpage Text",
    "version": "0.1",

    "permissions": [
        "activeTab"
    ],

    "background": {
        "scripts": [
            "background.js"
        ]
    },

    "browser_action": {
        "default_icon": {
            "32": "myIcon.png"
        },
        "default_title": "Get Webpage Text",
        "browser_style": true
    }
}

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

javascript - Extract the whole text contained in webpage using Chrome extension - Stack Overflow

1 Answer 1

与本文相关的文章

评论列表(0)