最新消息:雨落星辰是一个专注网站SEO优化、网站SEO诊断、搜索引擎研究、网络营销推广、网站策划运营及站长类的自媒体原创博客

python - Can I specify the replacement character in str([value], encoding='utf-8', errors='replace&#

programmeradmin3浏览0评论

Is it possible to specify the replacement character used by str(xxx,encoding='utf-8', errors='replace') to be something other than the diamond-question-mark character (�)?

I am attempting to fix up a GPS data filtering routine in Python for my GoPiGo robot.

The GPS module I am using returns NMEA "GPxxx" sentences as pure 8-bit ASCII values beginning with "$GPxxx" where "xxx" is a three character code for the type of data in the sentence.

For example GPS NMEA serial data sentences might look something like this:

$GPRMC,100905.00,A,5533.07171,N,03734.91789,E,2.657,,150325,,,A*76
$GPVTG,,T,,M,2.657,N,4.920,K,A*2A
$GPGGA,100905.00,5533.07171,N,03734.91789,E,1,04,2.81,183.7,M,13.4,M,,*58
$GPGSA,A,3,12,29,20,06,,,,,,,,,8.80,2.81,8.34*0A
$GPGSV,3,1,10,04,04,007,,05,08,138,,06,24,046,14,11,49,070,*77
$GPGSV,3,2,10,12,48,134,18,20,28,107,08,25,78,210,,28,38,290,11*7E
$GPGSV,3,3,10,29,49,251,14,31,24,315,*7E
$GPGLL,5533.07171,N,03734.91789,E,100905.00,A,A*69

Right now I am using str() to read the serial data, convert it to UTF-8 from raw ASCII, and print it. Sometimes the initial read generates some garbage characters which throw an exception, so I use errors='ignore' to prevent this:

self.raw_line = str(self.ser.readline().strip(),encoding='utf-8', errors='ignore')

The result is that when I start a serial communication session, some of the first characters read into the input stream are "garbage" characters. This appears to be a characteristic of the way Python reads the stream as putty in Windows and miniterm in Linux don't show these characters.

Reading GPS sensor for location . . . 
If you are not seeing coordinates appear, your sensor needs to be
outside to detect GPS satellites.
Reading GPS sensor for location . . . 
JSSH
ubbbbrb95)$GPRMC,131435.00,V,,,,,,,150325,,,N*7C
$GPVTG,,,,,,,,,N*30
$GPGGA,131435.00,,,,,0,00,99.99,,,,,,*67
$GPGSA,A,1,,,,,,,,,,,,,99.99,99.99,99.99*30
$GPGSV,3,1,12,05,36,054,,07,03,359,,13,10,076,,15,13,113,*75

...where JSSHubbbbrb95) represents a string of nonsensical characters that appear before the start of valid data.

What I want to do is replace invalid characters with "" instead of the diamond-question-mark.

I understand that I can filter them out. However it would be much more convenient if I could replace them with "nothing" at the time they're read.

Is it possible to specify a different replacement character when using errors='replace'?

Is it possible to specify the replacement character used by str(xxx,encoding='utf-8', errors='replace') to be something other than the diamond-question-mark character (�)?

I am attempting to fix up a GPS data filtering routine in Python for my GoPiGo robot.

The GPS module I am using returns NMEA "GPxxx" sentences as pure 8-bit ASCII values beginning with "$GPxxx" where "xxx" is a three character code for the type of data in the sentence.

For example GPS NMEA serial data sentences might look something like this:

$GPRMC,100905.00,A,5533.07171,N,03734.91789,E,2.657,,150325,,,A*76
$GPVTG,,T,,M,2.657,N,4.920,K,A*2A
$GPGGA,100905.00,5533.07171,N,03734.91789,E,1,04,2.81,183.7,M,13.4,M,,*58
$GPGSA,A,3,12,29,20,06,,,,,,,,,8.80,2.81,8.34*0A
$GPGSV,3,1,10,04,04,007,,05,08,138,,06,24,046,14,11,49,070,*77
$GPGSV,3,2,10,12,48,134,18,20,28,107,08,25,78,210,,28,38,290,11*7E
$GPGSV,3,3,10,29,49,251,14,31,24,315,*7E
$GPGLL,5533.07171,N,03734.91789,E,100905.00,A,A*69

Right now I am using str() to read the serial data, convert it to UTF-8 from raw ASCII, and print it. Sometimes the initial read generates some garbage characters which throw an exception, so I use errors='ignore' to prevent this:

self.raw_line = str(self.ser.readline().strip(),encoding='utf-8', errors='ignore')

The result is that when I start a serial communication session, some of the first characters read into the input stream are "garbage" characters. This appears to be a characteristic of the way Python reads the stream as putty in Windows and miniterm in Linux don't show these characters.

Reading GPS sensor for location . . . 
If you are not seeing coordinates appear, your sensor needs to be
outside to detect GPS satellites.
Reading GPS sensor for location . . . 
JSSH
ubbbbrb95)$GPRMC,131435.00,V,,,,,,,150325,,,N*7C
$GPVTG,,,,,,,,,N*30
$GPGGA,131435.00,,,,,0,00,99.99,,,,,,*67
$GPGSA,A,1,,,,,,,,,,,,,99.99,99.99,99.99*30
$GPGSV,3,1,12,05,36,054,,07,03,359,,13,10,076,,15,13,113,*75

...where JSSHubbbbrb95) represents a string of nonsensical characters that appear before the start of valid data.

What I want to do is replace invalid characters with "" instead of the diamond-question-mark.

I understand that I can filter them out. However it would be much more convenient if I could replace them with "nothing" at the time they're read.

Is it possible to specify a different replacement character when using errors='replace'?

Share Improve this question edited Mar 15 at 14:24 jonrsharpe 122k30 gold badges268 silver badges476 bronze badges asked Mar 15 at 14:11 Jim JR HarrisJim JR Harris 6232 gold badges8 silver badges20 bronze badges 4
  • Can you show the actual bytes received? – snakecharmerb Commented Mar 15 at 14:54
  • errors='ignore' would replace the invalid characters with nothing, although invalid ASCII characters won't be ignored (ASCII is a subset of UTF-8). None of the characters in your example would be replaced at all no matter the error handler since they are valid in UTF-8. – Mark Tolonen Commented Mar 16 at 3:47
  • 2 But it sounds like an x-y problem. Perhaps show your Python code that returns the garbage. – Mark Tolonen Commented Mar 16 at 3:50
  • Possibly, but I don't think so since similar constructs (in other languages?), allow the user to select the replacement character. Instead of an x-y problem, it's more like a "if I can do "x", it will be really easy. If not then plan-B is "y". It just makes sense to explore a potentially easy solution first so you don't end up reinventing the wheel. – Jim JR Harris Commented Mar 19 at 17:40
Add a comment  | 

1 Answer 1

Reset to default 1

I don't think you can replace errors with a custom character. However, you can make a custom error handler to replace invalid characters with a string of your choice.

You're going to need the codecs library to do that:

import codecs

def remove_invalid_bytes(error):
    # error is a UnicodeDecodeError
    # return a tuple: (replacement string, position to continue)
    return ("", error.end)

# register the custom error handler
codecs.register_error("remove", remove_invalid_bytes)

# now use the custom error handler when decoding
raw_bytes = b"some invalid data: \xff\xfe$GPRMC,..."
decoded_string = str(raw_bytes, encoding="utf-8", errors="remove")
print(decoded_string)

This should output something like this : some invalid data: $GPRMC,...

与本文相关的文章

发布评论

评论列表(0)

  1. 暂无评论