Following system colour scheme Selected dark colour scheme Selected light colour scheme

Python Enhancement Proposals

PEP 223 – Change the Meaning of\xEscapes

Author:
Tim Peters <tim.peters at gmail >
Status:
Final
Type:
Standards Track
Created:
20-Aug-2000
Python-Version:
2.0
Post-History:
23-Aug-2000

Table of Contents

Abstract

Change\xescapes, in both 8-bit and Unicode strings, to consume exactly the two hex digits following. The proposal views this as correcting an original design flaw, leading to clearer expression in all flavors of string, a cleaner Unicode story, better compatibility with Perl regular expressions, and with minimal risk to existing code.

Syntax

The syntax of\xescapes, in all flavors of non-raw strings, becomes

\xhh

where h is a hex digit (0-9, a-f, A-F). The exact syntax in 1.5.2 is not clearly specified in the Reference Manual; it says

\xhh...

implying “two or more” hex digits, but one-digit forms are also accepted by the 1.5.2 compiler, and a plain\xis “expanded” to itself (i.e., a backslash followed by the letter x). It’s unclear whether the Reference Manual intended either of the 1-digit or 0-digit behaviors.

Semantics

In an 8-bit non-raw string,

\xij

expands to the character

chr(int(ij,16))

Note that this is the same as in 1.6 and before.

In a Unicode string,

\xij

acts the same as

\u00ij

i.e. it expands to the obvious Latin-1 character from the initial segment of the Unicode space.

An\xnot followed by at least two hex digits is a compile-time error, specificallyValueErrorin 8-bit strings, andUnicodeError(a subclass ofValueError) in Unicode strings. Note that if an\xis followed by more than two hex digits, only the first two are “consumed”. In 1.6 and before all but thelasttwo were silently ignored.

Example

In 1.5.2:

>>>"\x123465 "# same as "\x65"
'e'
>>>"\x65"
'e'
>>>"\x1"
'\001'
>>>"\x\x"
'\\x\\x'
>>>

In 2.0:

>>>"\x123465 "# \x12 -> \022, "3456" left alone
'\0223456'
>>>"\x65"
'e'
>>>"\x1"
[ValueError is raised]
>>>"\x\x"
[ValueError is raised]
>>>

History and Rationale

\xescapes were introduced in C as a way to specify variable-width character encodings. Exactly which encodings those were, and how many hex digits they required, was left up to each implementation. The language simply stated that\x“consumed”allhex digits following, and left the meaning up to each implementation. So, in effect,\xin C is a standard hook to supply platform-defined behavior.

Because Python explicitly aims at platform independence, the\xescape in Python (up to and including 1.6) has been treated the same way across all platforms: allexceptthe last two hex digits were silently ignored. So the only actual use for\xescapes in Python was to specify a single byte using hex notation.

Larry Wall appears to have realized that this was the only real use for \xescapes in a platform-independent language, as the proposed rule for Python 2.0 is in fact what Perl has done from the start (although you need to run in Perl -w mode to get warned about\xescapes with fewer than 2 hex digits following – it’s clearly more Pythonic to insist on 2 all the time).

When Unicode strings were introduced to Python,\xwas generalized so as to ignore all but the lastfourhex digits in Unicode strings. This caused a technical difficulty for the new regular expression engine: SRE tries very hard to allow mi xing 8-bit and Unicode patterns and strings in intuitive ways, and it no longer had any way to guess what, for example,r "\x123456"should mean as a pattern: is it asking to match the 8-bit character\x56or the Unicode character\u3456?

There are hacky ways to guess, but it doesn’t end there. The ISO C99 standard also introduces 8-digit\U12345678escapes to cover the entire ISO 10646 character space, and it’s also desired that Python 2 support that from the start. But then what are\xescapes supposed to mean? Do they ignore all but the lasteighthex digits then? And if less than 8 following in a Unicode string, all but the last 4? And if less than 4, all but the last 2?

This was getting messier by the minute, and the proposal cuts the Gordian knot by making\xsimpler instead of more complicated. Note that the 4-digit generalization to\xijklin Unicode strings was also redundant, because it meant exactly the same thing as\uijklin Unicode strings. It’s more Pythonic to have just one obvious way to specify a Unicode character via hex notation.

Development and Discussion

The proposal was worked out among Guido van Rossum, Fredrik Lundh and Tim Peters in email. It was subsequently explained and discussed on Python-Dev under subject “Go x yourself”[1],starting 2000-08-03. Response was overwhelmingly positive; no objections were raised.

Backward Compatibility

Changing the meaning of\xescapes does carry risk of breaking existing code, although no instances of incompatibility have yet been discovered. The risk is believed to be minimal.

Tim Peters verified that, except for pieces of the standard test suite deliberately provoking end cases, there are no instances of\xabcdef... with fewer or more than 2 hex digits following, in either the Python CVS development tree, or in assorted Python packages sitting on his machine.

It’s unlikely there are any with fewer than 2, because the Reference Manual implied they weren’t legal (although this is debatable!). If there are any with more than 2, Guido is ready to argue they were buggy anyway <0.9 wink>.

Guido reported that the O’Reilly Python booksalreadydocument that Python works the proposed way, likely due to their Perl editing heritage (as above, Perl worked (very close to) the proposed way from its start).

Finn Bock reported that what JPython does with\xescapes is unpredictable today. This proposal gives a clear meaning that can be consistently and easily implemented across all Python implementations.

Effects on Other Tools

Believed to be none. The candidates for breakage would mostly be parsing tools, but the author knows of none that worry about the internal structure of Python strings beyond the approximation “when there’s a backslash, swallow the next character”. Tim Peters checked Python -mode.el,the stdtokenize.pyandpyclbr.py,and the IDLE syntax coloring subsystem, and believes there’s no need to change any of them. Tools liketabnanny.pyandcheckappend.pyinherit their immunity fromtokenize.py.

Reference Implementation

The code changes are so simple that a separate patch will not be produced. Fredrik Lundh is writing the code, is an expert in the area, and will simply check the changes in before 2.0b1 is released.

BDFL Pronouncements

Yes,ValueError,notSyntaxError.“Problems with literal interpretations traditionally raise ‘runtime’ exceptions rather than syntax errors.”

References


Source:https://github / Python /peps/blob/main/peps/pep-0223.rst

Last modified:2023-09-09 17:39:29 GMT