Percent-encoding

URL encoding,officially known aspercent-encoding,is a method toencodearbitrary data in auniform resource identifier(URI) using only theUS-ASCIIcharacters legal within a URI. Although it is known asURL encoding,it is also used more generally within the mainUniform Resource Identifier(URI) set, which includes bothUniform Resource Locator(URL) andUniform Resource Name(URN). Consequently, it is also used in the preparation of data of theapplication/x-www-form-urlencodedmedia type,as is often used in the submission ofHTML formdata inHTTPrequests.

Types

Percent-encoding in a URI

Types of URI characters

The characters allowed in a URI are eitherreservedorunreserved(or apercent characteras part of a percent-encoding).Reservedcharacters are those characters that sometimes have special meaning. For example,forward slashcharacters are used to separate different parts of a URL (or more generally, a URI).Unreservedcharacters have no such meanings. Using percent-encoding, reserved characters are represented using special character sequences. The sets of reserved and unreserved characters and the circumstances under which certain reserved characters have special meaning have changed slightly with each revision of specifications that govern URIs and URI schemes.

RFC 3986 section 2.2Reserved Characters(January 2005)
`!`	`#`	`$`	`&`	`'`	`(`	`)`	`*`	`+`	`,`	`/`	`:`	`;`	`=`	`?`	`@`	`[`	`]`

RFC 3986 section 2.3Unreserved Characters(January 2005)
`A`	`B`	`C`	`D`	`E`	`F`	`G`	`H`	`I`	`J`	`K`	`L`	`M`	`N`	`O`	`P`	`Q`	`R`	`S`	`T`	`U`	`V`	`W`	`X`	`Y`	`Z`
`a`	`b`	`c`	`d`	`e`	`f`	`g`	`h`	`i`	`j`	`k`	`l`	`m`	`n`	`o`	`p`	`q`	`r`	`s`	`t`	`u`	`v`	`w`	`x`	`y`	`z`
`0`	`1`	`2`	`3`	`4`	`5`	`6`	`7`	`8`	`9`	`-`	`.`	`_`	`~`

Other characters in a URI must be percent-encoded.

Reserved characters

When a character from the reserved set (a "reserved character" ) has a special meaning (a "reserved purpose" ) in a certain context, and a URI scheme says that it is necessary to use that character for someotherpurpose, then the character must bepercent-encoded.Percent-encoding a reserved character involves converting the character to its corresponding byte value inASCIIand then representing that value as a pair ofhexadecimaldigits (if there is a single hex digit, aleading zerois added). The digits, preceded by apercent sign(%) as anescape character,are then used in the URI in place of the reserved character. (A non-ASCII character is typically converted to its byte sequence inUTF-8,and then each byte value is represented as above.)

The reserved character/,for example, if used in the "path" component of aURI,has the special meaning of being adelimiterbetweenpath segments. If, according to a given URI scheme,/needs to beina path segment, then the three characters%2For%2fmust be used in the segment instead of a raw/.

Reserved characters after percent-encoding
`!`	`#`	`$`	`&`	`'`	`(`	`)`	`*`	`+`	`,`	`/`	`:`	`;`	`=`	`?`	`@`	`[`	`]`
`%21`	`%23`	`%24`	`%26`	`%27`	`%28`	`%29`	`%2A`	`%2B`	`%2C`	`%2F`	`%3A`	`%3B`	`%3D`	`%3F`	`%40`	`%5B`	`%5D`

Reserved characters that have no reserved purpose in a particular context may also be percent-encoded but are not semantically different from those that are not.

In the "query"component of a URI (the part after a?character), for example,/is still considered a reserved character but it normally has no reserved purpose, unless a particular URI scheme says otherwise. The character does not need to be percent-encoded when it has no reserved purpose.

URIs that differ only by whether a reserved character is percent-encoded or appears literally are normally considered not equivalent (denoting the same resource) unless it can be determined that the reserved characters in question have no reserved purpose. This determination is dependent upon the rules established for reserved characters by individual URI schemes.

Unreserved characters

Characters from the unreserved set never need to be percent-encoded.

URIs that differ only by whether an unreserved character is percent-encoded or appears literally are equivalent by definition, but URI processors, in practice, may not always recognize this equivalence. For example, URI consumersshould nottreat%41differently fromAor%7Edifferently from~,but some do. For maximal interoperability, URI producers are discouraged from percent-encoding unreserved characters.

Percent character

Because the percent character (%) serves to indicate percent-encoded octets, it must itself be percent-encoded as%25to be used as data within a URI.

Arbitrary data

Most URI schemes involve the representation of arbitrary data, such as anIP addressorfile systempath, as components of a URI. URI scheme specifications should, but often do not, provide an explicit mapping between URI characters and all possible data values being represented by those characters.

Binary data

Since the publication of RFC 1738 in 1994 it has been specified that schemes that provide for the representation ofbinary datain a URI must divide the data into 8-bit bytes and percent-encode each byte in the same manner as above.^[1]Byte value 0x0F, for example, should be represented by%0F,but byte value 0x41 can be represented byA,or%41.The use of unencoded characters for Alpha numeric and other unreserved characters is typically preferred, as it results in shorter URLs.

Character data

The procedure for percent-encoding binary data has often been extrapolated, sometimes inappropriately or without being fully specified, to apply to character-based data. In theWorld Wide Web's formative years, when dealing with data characters in the ASCII repertoire and using their corresponding bytes in ASCII as the basis for determining percent-encoded sequences, this practice was relatively harmless; it was just assumed that characters and bytes mapped one-to-one and were interchangeable. The need to represent characters outside the ASCII range, however, grew quickly, and URI schemes and protocols often failed to provide standard rules for preparing character data for inclusion in a URI. Web applications consequently began using different multi-byte,stateful,and other non-ASCII-compatible encodings as the basis for percent-encoding, leading to ambiguities and difficulty interpreting URIs reliably.

For example, many URI schemes and protocols based on RFCs 1738 and 2396 presume that the data characters will be converted to bytes according to some unspecifiedcharacter encodingbefore being represented in a URI by unreserved characters or percent-encoded bytes. If the scheme does not allow the URI to provide a hint as to what encoding was used, or if the encoding conflicts with the use of ASCII to percent-encode reserved and unreserved characters, then the URI cannot be reliably interpreted. Some schemes fail to account for encoding at all and instead just suggest that data characters map directly to URI characters, which leaves it up to implementations to decide whether and how to percent-encode data characters that are in neither the reserved nor unreserved sets.

Common characters after percent-encoding (ASCII or UTF-8 based)
`␣`	`"`	`%`	`-`	`.`	`<`	`>`	`\`	`^`	`_`	`	`{`	`\|`	`}`	`~`	`£`	`€`
`%20`	`%22`	`%25`	`%2D`	`%2E`	`%3C`	`%3E`	`%5C`	`%5E`	`%5F`	`%60`	`%7B`	`%7C`	`%7D`	`%7E`	`%C2%A3`	`%E2%82%AC`

Arbitrary character data is sometimes percent-encoded and used in non-URI situations, such as for password-obfuscation programs or other system-specific translation protocols.

Current standard

The generic URI syntax recommends that new URI schemes that provide for the representation of character data in a URI should, in effect, represent characters from the unreserved set without translation and should convert all other characters to bytes according toUTF-8,and then percent-encode those values. This suggestion was introduced in January 2005 with the publication of RFC 3986. URI schemes introduced before this date are not affected.

Not addressed by the current specification is what to do with encoded character data. For example, in computers, character data manifests in encoded form, at some level, and thus could be treated as either binary or character data when being mapped to URI characters. Presumably, it is up to the URI scheme specifications to account for this possibility and require one or the other, but in practice, few, if any, actually do.

Non-standard implementations

There exists a non-standard encoding for Unicode characters:%uxxxx,wherexxxxis aUTF-16code unit represented as four hexadecimal digits. This behavior is not specified by any RFC and has beenrejectedby the W3C. The 13th edition ofECMA-262still includes anescapefunction that uses this syntax, which appliesUTF-8encoding to a string, then percent-escapes the resulting bytes.^[2]

The application/x-www-form-urlencoded type

When data that has been entered into HTMLformsis submitted, the form field names and values are encoded and sent to the server in an HTTP request message using methodGETorPOST,or, historically, viaemail.^[3]The encoding used by default is based on an early version of the general URI percent-encoding rules,^[4]with a number of modifications such asnewlinenormalization and replacing spaces with+instead of%20.Themedia typeof data encoded this way isapplication/x-www-form-urlencoded,and it is currently defined in the HTML andXFormsspecifications. In addition, theCGIspecification contains rules for how web servers decode data of this type and make it available to applications.

When HTML form data is sent in an HTTP GET request, it is included in thequery componentof the request URI using the same syntax described above. When sent in an HTTPPOSTrequest or via email, the data is placed in the body of the message, andapplication/x-www-form-urlencodedis included in the message's Content-Type header.

References

^RFC 1738 §2.2; RFC 2396 §2.4; RFC 3986 §1.2.1, 2.1, 2.5.
^"ECMAScript 2017 Language Specification (ECMA-262, 8th edition, June 2017)".Ecma International.Archivedfrom the original on 2018-07-02.Retrieved2018-06-20.
^User-agent support for email basedHTMLform submission, using a 'mailto'URLas the form action, was proposed in RFC 1867 section 5.6, during the HTML 3.2 era. Various web browsers implemented it by invoking a separate email program or using their own rudimentarySMTPcapabilities. Although sometimes unreliable, it was briefly popular as a simple way to transmit form data without involving a web server orCGIscripts.
^Berners-Lee, T. (June 1994)."RFC 1630".IETF Tools.IETF.Archivedfrom the original on 21 June 2016.Retrieved29 June2016.

External links

The following specifications all discuss and define reserved characters, unreserved characters, and percent-encoding, in some form or other:

RFC3986/STD66 (pluserrata), the current generic URI syntax specification.
RFC2396(obsolete, pluserrata) and RFC 2732 (pluserrata) together comprised the previous version of the generic URI syntax specification.
RFC1738(mostly obsolete) and RFC 1808 (obsolete), which defineURLs.
RFC1630(obsolete), the first generic URI syntax specification.
W3C Guidelines on Naming and Addressing: URIs, URLs,...
W3C explanation of UTF-8 in URIs
W3C HTML form content types

[1] RFC 1738 §2.2; RFC 2396 §2.4; RFC 3986 §1.2.1, 2.1, 2.5.

[2] "ECMAScript 2017 Language Specification (ECMA-262, 8th edition, June 2017)".Ecma International.Archivedfrom the original on 2018-07-02.Retrieved2018-06-20.

[3] User-agent support for email basedHTMLform submission, using a 'mailto'URLas the form action, was proposed in RFC 1867 section 5.6, during the HTML 3.2 era. Various web browsers implemented it by invoking a separate email program or using their own rudimentarySMTPcapabilities. Although sometimes unreliable, it was briefly popular as a simple way to transmit form data without involving a web server orCGIscripts.

[4] Berners-Lee, T. (June 1994)."RFC 1630".IETF Tools.IETF.Archivedfrom the original on 21 June 2016.Retrieved29 June2016.

[1]

[2]

[3]

[4]

Types