python - Why my function that converts a string of digits to a list of bytes doesn't work?

I have written a simple function that converts a string consisting of only digits ('0123456789') to a list of bytes. More specifically it finds the longest possible sequences of digits that when converted to an integer is less than 256.

For example, given the decimal digits in order, the output should be (0, 123, 45, 67, 89).

Now I wrote this function very quickly:

from typing import List


DIGITS = {str(i): i for i in range(10)}

def get_bytes(s: str) -> List[int]:
    byte_list = []
    chunk = 0
    for c in f'{s}0':
        if (d := DIGITS.get(c)) is None:
            raise ValueError(f"{c} is not a digit")

        if (new := chunk * 10 + d) < 256:
            chunk = new
        else:
            byte_list.append(chunk)
            chunk = d

    return byte_list[not byte_list[0]:]

In [143]: get_bytes('0000')
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[143], line 1
----> 1 get_bytes('0000')

Cell In[142], line 14, in get_bytes(s)
     11         byte_list.append(chunk)
     12         chunk = d
---> 14 return byte_list[not byte_list[0]:]

IndexError: list index out of range

In [144]: get_bytes('100100100')
Out[144]: [100, 100, 100]

In [145]: get_bytes('12301230123')
Out[145]: [123, 123, 123]

I tried to fix the code, I couldn't. Now there is only one problem remaining, and the exact problem I tried to address with this question.

In the first example, expected output is [0, 0, 0, 0], in the third example the correct output should be [123, 0, 123, 0, 123]. How to fix this?

And for 0, I want every isolated 0 to be one number in the output. In short I want to make sure ''.join(str(i) for i in output) == original.

According to my tests, the code from the accepted answer performs better than the regex based approach, so I accepted it:

In [384]: get_bytes1(string) == get_bytes2(string)
Out[384]: True

In [385]: %timeit get_bytes1(string)
179 ms ± 588 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [386]: %timeit get_bytes2(string)
161 ms ± 748 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [387]: len(string)
Out[387]: 1000001

get_bytes2 is from the accepted answer, get_bytes1 is the regex based approach. And string is the first 1000000 decimal places of π plus a '3' in the front.

For example, given the decimal digits in order, the output should be (0, 123, 45, 67, 89).

Now I wrote this function very quickly:

from typing import List


DIGITS = {str(i): i for i in range(10)}

def get_bytes(s: str) -> List[int]:
    byte_list = []
    chunk = 0
    for c in f'{s}0':
        if (d := DIGITS.get(c)) is None:
            raise ValueError(f"{c} is not a digit")

        if (new := chunk * 10 + d) < 256:
            chunk = new
        else:
            byte_list.append(chunk)
            chunk = d

    return byte_list[not byte_list[0]:]

In [143]: get_bytes('0000')
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
Cell In[143], line 1
----> 1 get_bytes('0000')

Cell In[142], line 14, in get_bytes(s)
     11         byte_list.append(chunk)
     12         chunk = d
---> 14 return byte_list[not byte_list[0]:]

IndexError: list index out of range

In [144]: get_bytes('100100100')
Out[144]: [100, 100, 100]

In [145]: get_bytes('12301230123')
Out[145]: [123, 123, 123]

I tried to fix the code, I couldn't. Now there is only one problem remaining, and the exact problem I tried to address with this question.

In the first example, expected output is [0, 0, 0, 0], in the third example the correct output should be [123, 0, 123, 0, 123]. How to fix this?

And for 0, I want every isolated 0 to be one number in the output. In short I want to make sure ''.join(str(i) for i in output) == original.

According to my tests, the code from the accepted answer performs better than the regex based approach, so I accepted it:

In [384]: get_bytes1(string) == get_bytes2(string)
Out[384]: True

In [385]: %timeit get_bytes1(string)
179 ms ± 588 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [386]: %timeit get_bytes2(string)
161 ms ± 748 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [387]: len(string)
Out[387]: 1000001

get_bytes2 is from the accepted answer, get_bytes1 is the regex based approach. And string is the first 1000000 decimal places of π plus a '3' in the front.

Share Improve this question edited Mar 16 at 14:10 asked Mar 15 at 22:52 Ξένη Γήινος 3,6121 gold badge18 silver badges60 bronze badges

Add a comment |

3 Answers 3

Sorted by: Reset to default 2

I did one simple thing to fix the code. In the original code, zeros are skipped because when there is an isolated zero, chunk would be of course 0, and chunk * 10 is still 0.

I have added the condition that catches this one single case:

from typing import List


DIGITS = {str(i): i for i in range(10)}


def get_bytes(s: str) -> List[int]:
    byte_list = []
    chunk = 0

    for c in s:
        if (d := DIGITS.get(c)) is None:
            raise ValueError(f"{c} is not a digit")

        if (new := chunk * 10 + d) < 256:
            chunk = new
        else:
            byte_list.append(chunk)
            chunk = d

        if chunk == d == 0:
            byte_list.append(chunk)

    if chunk:
        byte_list.append(chunk)

    return byte_list

In [159]: get_bytes('0000')
Out[159]: [0, 0, 0, 0]

In [160]: get_bytes('1234567890123456789')
Out[160]: [123, 45, 67, 89, 0, 123, 45, 67, 89]

In [161]: get_bytes('10010010000')
Out[161]: [100, 100, 100, 0, 0]

Though of course, the fool-proof approach is to use re, and it is faster, but only a little bit:

import re


BYTES = repile(r'1\d\d|2[0-4]\d|25[0-5]|[1-9]\d?|0')


def get_bytes1(s: str) -> List[int]:
    return list(map(int, BYTES.findall(s)))

In [282]: %timeit get_bytes(string)
198 ms ± 6.54 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [283]: %timeit get_bytes1(string)
173 ms ± 503 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [284]: len(string)
Out[284]: 1000001

I have fixed the regex. Now is is working as intended, and no edge cases.

This may not be as concise as it could be but it does work for the test cases presented in the question.

DIGITS = {str(d): d for d in range(10)}

def get_bytes(s: str) -> list[int]:
    result = []
    for i, c in enumerate(s):
        if c == "0":
            result.append(0)
        else:
            break
    else:
        return result
    a = 0
    for c in s[i:]:
        x = DIGITS[c]
        t = a * 10 + x
        if t > 255:
            result.append(a)
            if x == 0:
                result.append(0)
            a = x
        elif x == 0 and a == 0:
            result.append(0)
        else:
            a = t
    return result if a == 0 else result + [a]
    
samples = [
    "0000",
    "12301230123",
    "100100100",
    "1234567890123456789",
    "10010010000"
]
for sample in samples:
    print(sample, get_bytes(sample))

Output:

0000 [0, 0, 0, 0]
12301230123 [123, 0, 123, 0, 123]
100100100 [100, 100, 100]
1234567890123456789 [123, 45, 67, 89, 0, 123, 45, 67, 89]
10010010000 [100, 100, 100, 0, 0]

def get_bytes2( input ):
    word = input[ 0 ]
    listOut = []
    for i in range( 1, len( input )):

        letter = input[ i ]
        
        if word != "0" and int( word + letter ) < 256:
            word += letter
        else:
            listOut.append( int( word ))
            word = letter
     
    if word != "":
        listOut.append( int( word ))
    return listOut

input = [ "1234567890", "0123456789", "12301230123", "100100100", "0000" ]
for str in input:
    print( get_bytes2( str ))
 
"""  
Outputs obtained:
[123, 45, 67, 89, 0]
[0, 123, 45, 67, 89]
[123, 0, 123, 0, 123]
[100, 100, 100]
[0, 0, 0, 0]

"""

First, we instantiate word with the first number in the string, then we enter the for that iterates through the string starting with the second number.

In the if, we check if word contains "0" or if the numeric value of the concatenation of word and letter is greater than "256", in which case we move on to the else, if not, add letter to word.

In the else, we add the numeric value of word to listOut and assign the contents of letter to it.

Finally, exiting the for, if word doesn't contain a "0" or is empty, we add the numeric value of its contents to listOut.

科技改变生活-雨落星辰 - 所有的伟大,都源于一个勇敢的开始

python - Why my function that converts a string of digits to a list of bytes doesn't work? - Stack Overflow

3 Answers 3

与本文相关的文章

评论列表(0)