I would like to extract text from HTML with pure Javascript (this is for a Chrome extension).
Specifically, I would like to be able to find text on a page and extract text after it.
Even more specifically, on a page like
.smilak/BestOfAmericaSGrandCircle#4974033581081755666
I would like to find text "Latitude" and extract the value that goes after it. HTML there is not in a very structured form.
What is an elegant solution to do it?
I would like to extract text from HTML with pure Javascript (this is for a Chrome extension).
Specifically, I would like to be able to find text on a page and extract text after it.
Even more specifically, on a page like
https://picasaweb.google./kevin.smilak/BestOfAmericaSGrandCircle#4974033581081755666
I would like to find text "Latitude" and extract the value that goes after it. HTML there is not in a very structured form.
What is an elegant solution to do it?
Share Improve this question asked May 22, 2011 at 19:59 dudarevdudarev 51 silver badge3 bronze badges5 Answers
Reset to default 2There is no elegant solution in my opinion because as you said HTML is not structured and the words "Latitude" and "Longitude" depends on page localization. Best I can think of is relying on the cardinal points, which might not change...
var data = document.getElementById("lhid_tray").innerHTML;
var lat = data.match(/((\d)*\.(\d)*)°(\s*)(N|S)/)[1];
var lon = data.match(/((\d)*\.(\d)*)°(\s*)(E|W)/)[1];
you could do
var str = document.getElementsByClassName("gphoto-exifbox-exif-field")[4].innerHTML;
var latPos = str.indexOf('Latitude')
lat = str.substring(str.indexOf('<em>',latPos)+4,str.indexOf('</em>',latPos))
The text you're interested in is found inside of a div
with class gphoto-exifbox-exif-field
. Since this is for a Chrome extension, we have document.querySelectorAll
which makes selecting that element easy:
var div = document.querySelectorAll('div.gphoto-exifbox-exif-field')[4],
text = div.innerText;
/* text looks like:
"Filename: img_3474.jpg
Camera: Canon
Model: Canon EOS DIGITAL REBEL
ISO: 800
Exposure: 1/60 sec
Aperture: 5.0
Focal Length: 18mm
Flash Used: No
Latitude: 36.872068° N
Longitude: 111.387291° W"
*/
It's easy to get what you want now:
var lng = text.split('Longitude:')[1].trim(); // "111.387291° W"
I used trim()
instead of split('Longitude: ')
since that's not actually a space character in the innerText
(URL-encoded, it's %C2%A0
...no time to figure out what that maps to, sorry).
I would query the DOM and just collect the image information into an object, so you can reference any property you want.
E.g.
function getImageData() {
var props = {};
Array.prototype.forEach.apply(
document.querySelectorAll('.gphoto-exifbox-exif-field > em'),
[function (prop) {
props[prop.previousSibling.nodeValue.replace(/[\s:]+/g, '')] = prop.textContent;
}]
);
return props;
}
var data = getImageData();
console.log(data.Latitude); // 36.872068° N
Well if a more general answer is required for other sites then you can try something like:
var text = document.body.innerHTML;
text = text.replace(/(<([^>]+)>)/ig,""); //strip out all HTML tags
var latArray = text.match(/Latitude:?\s*[^0-9]*[0-9]*\.?[0-9]*\s*°\s*[NS]/gim);
//search for and return an array of all found results for:
//"latitude", one or 0 ":", white space, A number, white space, 1 or 0 "°", white space, N or S
//(ignores case)(ignores multi-line)(global)
For that example an array of 1 element containing "Latitude: 36.872068° N" is returned (which should be easy to parse).