Thebyte-order mark(BOM) is a particular usage of the specialUnicodecharacter code,U+FEFFZERO WIDTH NO-BREAK SPACE,whose appearance as amagic numberat the start of a text stream can signal several things to aprogramreading the text:[1]
- the byte order, orendianness,of the text stream in the cases of 16-bitand 32-bit encodings;
- the fact that the text stream's encoding is Unicode, to a high level of confidence;
- which Unicode character encoding is used.
BOM use is optional. Its presence interferes with the use ofUTF-8by software that does not expect non-ASCIIbytes at the start of a file but that could otherwise handle the text stream.
Unicode can be encoded in units of 8-bit, 16-bit, or 32-bit integers. For the 16- and 32-bit representations, a computer receiving text from arbitrary sources needs to know which byte order the integers are encoded in. The BOM is encoded in the same scheme as the rest of the document and becomes anoncharacterUnicode code point if its bytes are swapped. Hence, the process accessing the text can examine these first few bytes to determine the endianness, without requiring some contract ormetadataoutside of the text stream itself. Generally the receiving computer will swap the bytes to its own endianness, if necessary, and would no longer need the BOM for processing.
The byte sequence of the BOM differs per Unicode encoding (including ones outside the Unicode standard such asUTF-7,seetable below), and none of the sequences is likely to appear at the start of text streams stored in other encodings. Therefore, placing an encoded BOM at the start of a text stream can indicate that the text is Unicode and identify the encoding scheme used. This use of the BOM is called a "Unicode signature".[2]
Usage
editThe BOM is, simply, the Unicode codepointU+FEFF ZERO WIDTH NO-BREAK SPACE
,encoded in the current encoding. A text file beginning with the bytesFE FF
suggests that the file is encoded in big-endian UTF-16.
The name ZWNBSP should be used if the BOM appears in the middle of a data stream. Unicode says it should be interpreted as a normal codepoint (namely aword joiner), not as a BOM. Since Unicode 3.2, this usage has been deprecated in favor ofU+2060WORD JOINER
.[1]
The Unicode 1.0 name for this codepoint is alsoBYTE ORDER MARK
[3]
UTF-8
editTheUTF-8representation of the BOM is the (hexadecimal) byte sequenceEF BB BF
.
The Unicode Standard permits the BOM inUTF-8,[4]but does not require or recommend its use.[5]UTF-8 always has the same byte order,[6]so its only use in UTF-8 is to signal at the start that the text stream is encoded in UTF-8, or that it was converted to UTF-8 from a stream that contained an optional BOM. The standard also does not recommend removing a BOM when it is there, so that round-tripping between encodings does not lose information, and so that code that relies on it continues to work.[7][8]The IETF recommends that if a protocol either (a) always uses UTF-8, or (b) has some other way to indicate what encoding is being used, then it "SHOULD forbid use of U+FEFF as a signature."[9]An example of not following this recommendation is the IETFSyslogprotocol which requires text to be in UTF-8 and also requires the BOM.[10]
Not using a BOM allows text to be backwards-compatible with software designed forextended ASCII.For instance many programming languages permit non-ASCIIbytes instring literalsbut not at the start of the file.
A BOM is unnecessary for detecting UTF-8 encoding.[citation needed]UTF-8 is a sparse encoding: a large fraction of possible byte combinations do not result in valid UTF-8 text. Binary data and text in any other encoding are likely to contain byte sequences that are invalid as UTF-8, so existence of such invalid sequences indicates the file is not UTF-8, while lack of invalid sequences is a very strong indication the textisUTF-8. Practically the only exception is text containing only ASCII-range bytes, as this may be a non-ASCII 7-bit encoding, but this is unlikely in any modern data and even then the difference from ASCII is minor (such as changing '\' to '¥').
Microsoftcompilers[11]and interpreters, and many pieces of software onMicrosoft Windowssuch asNotepad(prior to Windows 10 Build 1903[12]) treat the BOM as a requiredmagic numberrather than use heuristics. These tools add a BOM when saving text as UTF-8, and cannot interpret UTF-8 unless the BOM is present or the file contains only ASCII.Windows PowerShell(up to 5.1) will add a BOM when it saves UTF-8 XML documents. However, PowerShell Core 6 has added a-Encoding
switch on some cmdlets called utf8NoBOM so that document can be saved without BOM.Google Docsalso adds a BOM when converting a document to aplain textfile for download.
UTF-16
editInUTF-16,a BOM (U+FEFF
) may be placed as the first bytes of a file or character stream to indicate the endianness (byte order) of all the 16-bitcode unitsof the file or stream. If an attempt is made to read this stream with the wrong endianness, the bytes will be swapped, thus delivering the characterU+FFFE
,whichis definedby Unicode as a "noncharacter"that should never appear in the text.
- If the 16-bit units are represented inbig-endianbyte order ( "UTF-16BE" ), the BOM is the (hexadecimal) byte sequence
FE FF
- If the 16-bit units uselittle-endianorder ( "UTF-16LE" ), the BOM is the (hexadecimal) byte sequence
FF FE
For theIANAregistered charsets UTF-16BE and UTF-16LE, a byte-order mark should not be used because the names of these character sets already determine the byte order.
If there is no BOM, it is possible to guess whether the text is UTF-16 and its byte order by searching for ASCII characters (i.e. a 0 byte adjacent to a byte in the 0x20-0x7E range, also 0x0A and 0x0D for CR and LF). A large number (i.e. far higher than random chance) in the same order is a very good indication of UTF-16 and whether the 0 is in the even or odd bytes indicates the byte order. However, this can result inbothfalse positives and false negatives.
Clause D98 of conformance (section 3.10) of the Unicode standard states, "The UTF-16 encoding scheme may or may not begin with a BOM. However, when there is no BOM, and in the absence of a higher-level protocol, the byte order of the UTF-16 encoding scheme is big-endian." Whether or not a higher-level protocol is in force is open to interpretation. Files local to a computer for which the native byte ordering is little-endian, for example, might be argued to be encoded as UTF-16LE implicitly. Therefore, the presumption of big-endian is widely ignored. TheW3C/WHATWGencoding standard used in HTML5 specifies that content labelled either "utf-16" or "utf-16le" are to be interpreted as little-endian "to deal with deployed content".[13]However, if a byte-order mark is present, then that BOM is to be treated as "more authoritative than anything else".[14]
UTF-32
editAlthough a BOM could be used withUTF-32,this encoding is rarely used for transmission. Otherwise the same rules as forUTF-16are applicable.
The BOM for little-endian UTF-32 is the same pattern as a little-endian UTF-16 BOM followed by a UTF-16 NUL character, an unusual example of the BOM being the same pattern in two different encodings. Programmers using the BOM to identify the encoding will have to decide whether UTF-32 or UTF-16 with a NUL first character is more likely.
Byte-order marks by encoding
editThis table illustrates how the BOM is represented as a byte sequence in various encodings and how those sequences might appear in a text editor that isinterpreting each byte as a legacy encoding(Windows-1252andcaret notationfor theC0 controls):
Encoding | Representation (hexadecimal) | Representation (decimal) | Bytes interpreted as Windows-1252 |
---|---|---|---|
UTF-8[a] | EF BB BF | 239 187 191 |  |
UTF-16(BE) | FE FF | 254 255 | þÿ |
UTF-16(LE) | FF FE | 255 254 | ÿþ |
UTF-32(BE) | 00 00 FE FF | 0 0 254 255 | ^@^@þÿ(^@is thenull character) |
UTF-32(LE) | FF FE 00 00 | 255 254 0 0 | ÿþ^@^@(^@is the null character) |
UTF-7[a] | 2B 2F 76[b][16][17] | 43 47 118 | +/v |
UTF-1[a] | F7 64 4C | 247 100 76 | ÷dL |
UTF-EBCDIC[a] | DD 73 66 73 | 221 115 102 115 | Ýsfs |
SCSU[a] | 0E FE FF[c] | 14 254 255 | ^Nþÿ(^Nis the"shift out" character) |
BOCU-1[a] | FB EE 28 | 251 238 40 | ûî( |
GB18030[a] | 84 31 95 33 | 132 49 149 51 | „1•3 |
- ^abcdefgThis is not literally a "byte order" mark, since a code unit in these encodings is one byte and therefore cannot have bytes in a "wrong" order. Nevertheless, the BOM can be used to indicate the encoding of the text that follows it.[6][15]
- ^Followed by
38
,39
,2B
,or2F
(ASCII8
,9
,+
or/
), depending on what the next character is. - ^SCSU allows other encodings of U+FEFF, the shown form is the signature recommended in UTR #6.[18]
See also
edit- Left-to-right mark
- Arabic Presentation Forms-B,block to which code point
U+FEFF
belongs
References
edit- ^ab"FAQ - UTF-8, UTF-16, UTF-32 & BOM".Unicode.org.Retrieved28 January2017.
- ^"The Unicode® Standard Version 9.0"(PDF).The Unicode Consortium.
- ^"Zero Width No-Break Space (U+Feff)".
- ^"The Unicode Standard 5.0, Chapter 2:General Structure"(PDF).p. 36.Retrieved29 March2009.
Table 2-4. The Seven Unicode Encoding Schemes
- ^"The Unicode Standard 5.0, Chapter 2:General Structure"(PDF).p. 36.Retrieved30 November2008.
Use of a BOM is neither required nor recommended for UTF-8, but may be encountered in contexts where UTF-8 data is converted from other encoding forms that use a BOM or where the BOM is used as a UTF-8 signature
- ^ab"FAQ - UTF-8, UTF-16, UTF-32 & BOM: Can a UTF-8 data stream contain the BOM character (in UTF-8 form)? If yes, then can I still assume the remaining UTF-8 bytes are in big-endian order?".Unicode.org.Retrieved4 January2009.
- ^"Re: pre-HTML5 and the BOM from Asmus Freytag on 2012-07-13 (Unicode Mail List Archive)".Unicode.org.Retrieved14 July2012.
- ^"Bug ID: JDK-6378911 UTF-8 decoder handling of byte-order mark has changed".Bugs.java.Retrieved14 October2021.
- ^Yergeau, Francois (November 2003).UTF-8, a transformation format of ISO 10646.IETF.doi:10.17487/RFC3629.RFC3629.Retrieved15 May2014.
- ^Gerhards, Rainer (March 2009)."MSG".The Syslog Protocol.IETF.sec. 6.4.doi:10.17487/RFC5424.RFC5424.
- ^Alf P. Steinbach (2011)."Unicode part 1: Windows console i/o approaches".Retrieved24 March2012.
However, since the C++ source code was encoded as UTF-8 without BOM (as is usual in Linux), the Visual C++ compiler erroneously assumed that the source code was encoded as Windows ANSI.
- ^"Windows 10 Notepad is Getting Better UTF-8 Encoding Support".BleepingComputer.Retrieved7 March2023.
- ^"UTF-16LE".Encoding Standard.WHATWG.
- ^"Decode".Encoding Standard.WHATWG.
- ^Yergeau, François (8 November 2003)."RFC 3629 - UTF-8, a transformation format of ISO 10646".Ietf Datatracker.Retrieved28 January2017.
- ^Honermann, Tom (2 January 2021)."Clarify guidance for use of a BOM as a UTF-8 encoding signature"(PDF).Unicode.
- ^"SDL Documentation".
- ^Markus Scherer."UTS #6: Compression Scheme for Unicode".Unicode.org.Retrieved28 January2017.