Ben Kallus's cool website

Do not mix bytes and str!

You have the unfortunate job of taking an untyped Python codebase and adding type annotations to it. Most of this is pretty mechanical and easy, but then you run into this:

def parse_http_header_value(s):
    s = s.strip()
    if is_valid(s):
        return s
    raise ReallyBadException("u r bad")

some_str: str
g: Callable[[str], str]
g(parse_http_header_value(some_str))

some_bytes: bytes
h: Callable[[bytes], bytes]
h(parse_http_header_value(some_bytes))

You can't annotate parse_http_header_value as (str -> str) because it can operate on bytes, and you can't annotate it as (bytes -> bytes) because it can operate on str. You can't even annotate it as (str | bytes -> str | bytes) because that doesn't express that the return type is the same as the argument type, so the calls to g and h will not type check. You're sure there's a solution using TypeVars, but you don't really understand TypeVars and don't want to make this complicated.

Then, you have a realization. A bytes object is just a sequence of integers in [0, 0x100). A str object is just a sequence of integers (called "code points") in [0, 0x110000). You can just convert the bytes object into the str object representing exactly the same sequence of integers, pass that to f, then convert the result back to bytes in exactly the same way. You know how to do this; just use bytes.decode and bytes.encode to convert the bytes object to and from str. You split parse_http_header_value into two functions, one (str -> str) and the other (bytes -> bytes):

def parse_http_header_value(s: str) -> str:
    s = s.strip()
    if is_valid(s):
        return s
    raise ReallyBadException("u r bad")

def parse_http_header_value_for_bytes(s: bytes) -> bytes:
    return parse_http_header_value_for_str(s.decode()).encode()

You replace each call to parse_http_header_value with a call to either f_for_str or f_for_bytes, and a test fails:

Traceback (most recent call last):
  File "/the/path/to/the/tests.py", line 7, in <module>
    assert parse_http_header_value_for_bytes(b"\xff") == b"\xff"
           ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^
  File "/the/path/to/the/tests.py", line 5, in f_for_bytes
    return parse_http_header_value_for_str(s.decode()).encode()
                                           ~~~~~~~~^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

This is just regular Unicode bullshit, though. You know how to solve this; just specify the "latin1" encoding during the conversion to ensure that each byte is converted to the code point of the same value:

def parse_http_header_value_for_bytes(s: bytes) -> bytes:
    return parse_http_header_value_for_str(s.decode("latin1")).encode("latin1")

You re-run the tests, and another test fails:

Traceback (most recent call last):
  File "/the/path/to/the/tests.py", line 9, in <module>
    assert parse_http_header_value_for_bytes(b"\x85\xa0") == b"\x85\xa0"
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

You open a REPL and this is what you discover:

>>> b"\xa0\x85".strip()
b'\xa0\x85'
>>> "\xa0\x85".strip()
''

More Unicode bullshit! Apparently str.strip removes code points 0xa0 and 0x85, but bytes.strip doesn't remove bytes 0xa0 and 0x85.

Weird.

You know that these tests were written by pedantic fools; no one will ever give you bytes outside of the ASCII range. You don't care about this stupid edge case, and just want to move on. You delete the test and hope that no one else notices.

Another test fails:

Traceback (most recent call last):
  File "/the/path/to/the/tests.py", line 7, in <module>
    assert parse_http_header_value_for_bytes(b"\x1c\x1d\x1e\x1f") == b"\x1c\x1d\x1e\x1f"
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
AssertionError

Back to the REPL :(

>>> b"\x1c\x1d\x1e\x1f".strip()
b'\x1c\x1d\x1e\x1f'
>>> "\x1c\x1d\x1e\x1f".strip()
b''

Okay, so even in the ASCII range, str.strip and bytes.strip do not do the same thing. You have learned something new, and you are disappointed. Even if you could satisfy the type checker, the problem would not be solved. The system almost certainly contains unintended behavior due to these discrepancies. This will not be easy to fix.

Ben Kallus

Do not mix bytes and str!

This post was brought to you by the following bugs: