We have a node js application which we have recently moved over from running on IIS 7 (via IIS node) to running on Linux (Elastic Beanstalk). Since we switched, we've been getting a lot of non UTF-8 urls being sent to our application (mainly from crawlers) such as:
Bj%F6rk
which IIS was converting to Björk
. This is now being passed to our application and our web framework (express) eventually calls down to
decodeURIComponent('Bj%F6rk');
URIError: URI malformed
at decodeURIComponent (native)
at repl:1:1
at REPLServer.self.eval (repl.js:110:21)
at repl.js:249:20
at REPLServer.self.eval (repl.js:122:7)
at Interface.<anonymous> (repl.js:239:12)
at Interface.emit (events.js:95:17)
at Interface._onLine (readline.js:203:10)
at Interface._line (readline.js:532:8)
at Interface._ttyWrite (readline.js:761:14)
Is there a remended safe way we can perform the same conversion as IIS before sending the url string to express?
Bearing in mind
- We are receiving requests to these badly encoded URLS and
- There is a way to decode them using the deprecated
unescape
javascript function and The majority of the requests to these URLs are ing from Bing Bot and we want to minimise any adverse effect on our search rankings.
- Should we really be doing this for all ining URLs?
- Are there any security or performance implications we should be concerned about?
- Should we be concerned about
unescape
being removed in the near future? - Is there a better / safer way to solve this problem (Yes we did read that MDN article linked to above)
We have a node js application which we have recently moved over from running on IIS 7 (via IIS node) to running on Linux (Elastic Beanstalk). Since we switched, we've been getting a lot of non UTF-8 urls being sent to our application (mainly from crawlers) such as:
Bj%F6rk
which IIS was converting to Björk
. This is now being passed to our application and our web framework (express) eventually calls down to
decodeURIComponent('Bj%F6rk');
URIError: URI malformed
at decodeURIComponent (native)
at repl:1:1
at REPLServer.self.eval (repl.js:110:21)
at repl.js:249:20
at REPLServer.self.eval (repl.js:122:7)
at Interface.<anonymous> (repl.js:239:12)
at Interface.emit (events.js:95:17)
at Interface._onLine (readline.js:203:10)
at Interface._line (readline.js:532:8)
at Interface._ttyWrite (readline.js:761:14)
Is there a remended safe way we can perform the same conversion as IIS before sending the url string to express?
Bearing in mind
- We are receiving requests to these badly encoded URLS and
- There is a way to decode them using the deprecated
unescape
javascript function and The majority of the requests to these URLs are ing from Bing Bot and we want to minimise any adverse effect on our search rankings.
- Should we really be doing this for all ining URLs?
- Are there any security or performance implications we should be concerned about?
- Should we be concerned about
unescape
being removed in the near future? - Is there a better / safer way to solve this problem (Yes we did read that MDN article linked to above)
3 Answers
Reset to default 12 +200Should we really be doing this for all ining URLs?
No, you shouldn't. The request being made uses non-UTF8 URI ponents. That shouldn't be your problem.
Are there any security or performance implications we should be concerned about?
The encoding of a URI ponent is not a security issue. Injection attempts via querystring or path params are. But that's another subject. In terms of performance, every middleware will make your responses take a bit longer. But I wouldn't even worry about that. If you want to decode the URI yourself, just do it. It'll only take a few milliseconds.
Should we be concerned about unescape being removed in the near future?
Actually you should. unescape
is deprecated. If you still want to use it; just check if it exists first. i.e. 'unescape' in global
. You can also use the built-in alternate: require('querystring').unescape()
which won't produce the same result in every case but it won't throw a URIError
. (Not remended though).
To minimise any adverse effect on search rankings:
Determine which status code your express app returns in these cases. It could be 500 (INTERNAL SERVER ERROR) which will look bad and 404 (NOT FOUND) which will tell the crawler you don't have a result for the query (which may not be true).
In these cases, I suggest you override this by returning a client error such as 400 (BAD REQUEST) instead, since the origin of the problem is a malformed URI ponent being requested, which should be in UTF-8 but it's not. The crawler/bot should be concerned about that.
// middleware for responding with BAD REQUEST
app.use(function (err, req, res, next) {
if (err instanceof URIError) {
res.status(400).send();
}
});
Above all, trying to return a result for a malformed URI has other side effects. First, you'll be allowing a bad request — can't be good :). Secondly, it'll mean you have a result for a bad URI which will get stored by crawlers/bots when they get a 200 OK response and it will get spread. Then you'll have to deal with more bad requests.
To conclude; don't decode via unescape
. Express already tries to decode via what's proper: decodeURIComponent
. If that fails, let it be.
Node.js queryString
library has safe implementation of escape
and unescape
methods. They both uses utf-8 encoding. unescape
first tries decodeURIComponent
and when fails it tries with a safe fast alternative implementation.
> querystring.escape('ö')
'%C3%B6'
> querystring.unescape('%C3%B6')
'ö'
But you have latin-1 encoded string (%F6
instead of %C3%B6
), so querystring.unescape
would give unexpected result, but it wouldn't break your code:
> querystring.unescape('Bj%F6rk')
'Bj�rk'
You might be able to convert from latin1 to utf-8 and get the right string using iconv
or iconv-lite
package. But URL encoding should be in UTF-8. So I think it's safe to ignore other encoded strings and just use querystring.unescape
.
In express 4.7.x, you can set query parser
configuration to simple
to use querystring.parse
which internally uses querystring.unescape
.
app.set('query parser', 'simple') // or 'extended' to use 'qs' module
I remend Nodejs decode-uri-charset, https://www.npmjs./package/decode-uri-charset
var url_decode = require('decode-uri-charset');
console.log(url_decode('%C7%CF%C0%CC', 'euc-kr'))