I am trying to scrape some data from a webpage with nodejs but I am having problems with character encoding.
The web page states that it's encoding is:
<meta http-equiv="Content-Type" content="text/html; charset=windows-1250">
And when I browse it with chrome it sets encoding to windows-1250 and everything looks fine.
As there is no windows-1250 encoding/decoding for streams in node (and utf8 did not work), I found an iconv-lite package which should be able to easily convert between different encodings. But I still get wrong characters after I save the response into a file (or output into console). I also tried different encodings, native node buffer encodings, setting headers to the same as what I see in chrome (Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3
) but nothing seems to work correctly.
You can see the whole code in here .
I suppose I am missing something fundamental regarding how the encoding works so any help on how to get the data with correct characters would be appreciated.
EDIT:
Also tried the node-iconv package in case it is a package problem. Changed line 51 to:
var decoder = new Iconv_native('WINDOWS-1250', 'UTF-8');
var decoded = decoder.convert(body).toString();
but still getting the same results.
I am trying to scrape some data from a webpage with nodejs but I am having problems with character encoding.
The web page states that it's encoding is:
<meta http-equiv="Content-Type" content="text/html; charset=windows-1250">
And when I browse it with chrome it sets encoding to windows-1250 and everything looks fine.
As there is no windows-1250 encoding/decoding for streams in node (and utf8 did not work), I found an iconv-lite package which should be able to easily convert between different encodings. But I still get wrong characters after I save the response into a file (or output into console). I also tried different encodings, native node buffer encodings, setting headers to the same as what I see in chrome (Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3
) but nothing seems to work correctly.
You can see the whole code in here https://gist.github./4110999.
I suppose I am missing something fundamental regarding how the encoding works so any help on how to get the data with correct characters would be appreciated.
EDIT:
Also tried the node-iconv package in case it is a package problem. Changed line 51 to:
var decoder = new Iconv_native('WINDOWS-1250', 'UTF-8');
var decoded = decoder.convert(body).toString();
but still getting the same results.
Share edited Nov 19, 2012 at 17:27 aocenas asked Nov 19, 2012 at 14:55 aocenasaocenas 1312 silver badges8 bronze badges 02 Answers
Reset to default 2I'm not familiar with the iconv-lite package, but looking through it's code, it looks like you'll need to use win1250
instead of windows1250
(see here)
The encodings are looked up as a hash.
Also, the readme uses this code instead of 'windows1251':
str = iconv.decode(buf, 'win1251');
I think, you are converting String, but you must convert a raw bytes! If (you are reading something from web, you must read it as binary)
Example reading file in win-1250 from disk:
var Buffer = require('buffer').Buffer;
var Iconv = require('iconv').Iconv;
//without options (encoding is not specified), 'fs' reads as raw bytes.
var bytes= fs.readFileSync('myFile.txt');
//this is bad: var myBadString = fs.readFileSync('myFile.txt', { encoding: "UTF-8" });
var buf = new Buffer(bytes, 'binary');
var translated = new Iconv('CP1250', 'UTF8').convert(buf).toString();