最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

javascript - Getting correct string from windows-1250 encoded web page with node.js - Stack Overflow

programmeradmin0浏览0评论

I am trying to scrape some data from a webpage with nodejs but I am having problems with character encoding. The web page states that it's encoding is: <meta http-equiv="Content-Type" content="text/html; charset=windows-1250"> And when I browse it with chrome it sets encoding to windows-1250 and everything looks fine.

As there is no windows-1250 encoding/decoding for streams in node (and utf8 did not work), I found an iconv-lite package which should be able to easily convert between different encodings. But I still get wrong characters after I save the response into a file (or output into console). I also tried different encodings, native node buffer encodings, setting headers to the same as what I see in chrome (Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3) but nothing seems to work correctly.

You can see the whole code in here .

I suppose I am missing something fundamental regarding how the encoding works so any help on how to get the data with correct characters would be appreciated.

EDIT:
Also tried the node-iconv package in case it is a package problem. Changed line 51 to:

var decoder = new Iconv_native('WINDOWS-1250', 'UTF-8');  
var decoded = decoder.convert(body).toString();

but still getting the same results.

I am trying to scrape some data from a webpage with nodejs but I am having problems with character encoding. The web page states that it's encoding is: <meta http-equiv="Content-Type" content="text/html; charset=windows-1250"> And when I browse it with chrome it sets encoding to windows-1250 and everything looks fine.

As there is no windows-1250 encoding/decoding for streams in node (and utf8 did not work), I found an iconv-lite package which should be able to easily convert between different encodings. But I still get wrong characters after I save the response into a file (or output into console). I also tried different encodings, native node buffer encodings, setting headers to the same as what I see in chrome (Accept-Charset': 'ISO-8859-1,utf-8;q=0.7,*;q=0.3) but nothing seems to work correctly.

You can see the whole code in here https://gist.github./4110999.

I suppose I am missing something fundamental regarding how the encoding works so any help on how to get the data with correct characters would be appreciated.

EDIT:
Also tried the node-iconv package in case it is a package problem. Changed line 51 to:

var decoder = new Iconv_native('WINDOWS-1250', 'UTF-8');  
var decoded = decoder.convert(body).toString();

but still getting the same results.

Share edited Nov 19, 2012 at 17:27 aocenas asked Nov 19, 2012 at 14:55 aocenasaocenas 1312 silver badges8 bronze badges 0
Add a ment  | 

2 Answers 2

Reset to default 2

I'm not familiar with the iconv-lite package, but looking through it's code, it looks like you'll need to use win1250 instead of windows1250 (see here)

The encodings are looked up as a hash.

Also, the readme uses this code instead of 'windows1251':

str = iconv.decode(buf, 'win1251');

I think, you are converting String, but you must convert a raw bytes! If (you are reading something from web, you must read it as binary)

Example reading file in win-1250 from disk:

var Buffer = require('buffer').Buffer;
var Iconv = require('iconv').Iconv; 

//without options (encoding is not specified), 'fs' reads as raw bytes.
var bytes= fs.readFileSync('myFile.txt'); 
//this is bad: var myBadString = fs.readFileSync('myFile.txt', { encoding: "UTF-8" });

var buf = new Buffer(bytes, 'binary');
var translated = new Iconv('CP1250', 'UTF8').convert(buf).toString();
发布评论

评论列表(0)

  1. 暂无评论