Windows-1252
MIME / IANA | windows-1252[1] |
---|---|
Alias(es) | cp1252 (code page1252) |
Language(s) | All supported byISO/IEC 8859-1plus full support for French and Finnish andligatureforms forEnglish;e.g. Danish(except for arare exceptional letter),Irish, Italian, Norwegian, Portuguese, Spanish, Swedish, German (missing uppercaseẞ), Icelandic, Faroese, Luxembourgish, Albanian, Estonian, Swahili, Tswana, Catalan, Basque, Occitan,Rotokas,Toki Pona, Lojban, Romansh, Dutch (except the IJ/ij character, substituted byIJ/ijorÿ), and Slovene (except thečcharacter, substituted byç). |
Created by | Microsoft |
Standard | WHATWGEncoding Standard |
Classification | extended ASCII,Windows-125x |
Extends | ISO 8859-1(excluding C1 controls) |
Transforms / Encodes | ISO 8859-15 |
Succeeded by | Unicode(UTF-8,UTF-16) |
Windows-1252orCP-1252(Windows code page1252) is alegacysingle-bytecharacter encoding[2]that is used by default (as the "ANSI code page" ) inMicrosoft Windowsthroughout theAmericas,Western Europe,Oceania,and much ofAfrica.[3]
Initially the same asISO 8859-1,it began to diverge starting inWindows 2.0by adding additional characters in the 0x80 to 0x9F (hex) range (the ISO standards reserve this range forC1 control codes). Notable additional characters includecurly quotation marksand all printable characters fromISO 8859-15.
It is the most-used single-byte character encoding in the world. Although almost all websites now use the multi-byte character encodingUTF-8,as of July 2024 1.2%[4]of websites declaredISO 8859-1which is treated as Windows-1252 by all modern browsers (as demanded by theHTML5standard[5]), plus 0.3% declared Windows-1252 directly,[4][6]for a total of 1.5%. Some countries or languages show a higher usage than the global average, in 2024 Brazil according to website use, use is at 3.4%,[7]and in Germany at 2.7%.[8][9](these are the sums of ISO-8859-1 and CP-1252 declarations).
Name
[edit]It is known to Windows by thecode pagenumber 1252, and by theIANA-approved name "windows-1252".
Historically, the phrase "ANSI Code Page" was used in Windows to refer to non-DOS encodings; the intention was that most of these would beANSIstandards such asISO-8859-1.Even though Windows-1252 was the first and by far most popular code page named so in Microsoft Windows parlance, the code page has never been an ANSI standard. Microsoft explains, "The term ANSI as used to signify Windows code pages is a historical reference, but is nowadays a misnomer that continues to persist in the Windows community."[10]
LaTeXcan input Windows-1252 by usinginputenc.stywith parameteransinew(and more recentlycp1252). [11][12]
IBMuses code page 1252 (CCSID1252 andeuro signextended CCSID 5348) for Windows-1252.[13][14][15]
It is called "WE8MSWIN1252" byOracle Database.[16]
History
[edit]- The first version of the codepage was used in MicrosoftWindows 1.0.It matched the ISO-8859-1 standard (including leaving code points 0xD7 and 0xF7 undefined, as they were not in the standard at that time).
- The second version of the codepage was introduced in MicrosoftWindows 2.0.In this version, code points 0xD7, 0xF7, 0x91, and 0x92 are defined.
- The third version of the codepage was introduced in MicrosoftWindows 3.1.It defined all code points used in the final version except theeuro signand theZ with caroncharacter pair.
- The final version (shown below) was introduced in MicrosoftWindows 98.
Starting in the 1990s, manyMicrosoftproducts that could produce HTML included Windows-1252-exclusive characters, but marked theencodingas ISO-8859-1, ASCII, or undeclared.[citation needed]Characters exclusive to Windows-1252 would render incorrectly on non-Windows operating systems (often as question marks).[17][18]In particular, typographers' quotes—curly variants of the standard straightapostrophesandquotation marksin US-ASCII—were commonly used in files produced in Windows applications such asMicrosoft Worddue to thesmart quotesfeature, which can automatically convert straight apostrophes and quotation marks to the curly variants.[19]To fix this, by 2000 most web browsers and e-mail clients treated the charsets ISO-8859-1 and US-ASCII as Windows-1252[citation needed]—this behavior is now required by the HTML5 specification.[5]Undeclared charsets in HTML are also assumed to be Windows-1252.[20][21]
AlthoughWindows NTsupportedUnicodeand attempted to encourage programs to use it, it only provided the 16-bit code units ofUCS-2/UTF-16,despite the existing support for other multibyte character encodings. As many applications preferred to use 8-bit strings, Windows-1252 remained the most popular encoding on Windows even after it added support for UTF-16.Unicode support in Windowshas improved over time, withUTF-8support available starting inWindows 10.
Codepage layout
[edit]The following table shows Windows-1252. Differences fromISO-8859-1have theUnicodecode pointnumber below the character, based on the Unicode.org mapping of Windows-1252 with "best fit". A tooltip, generally available only when one points to the immediate right of the character, shows the Unicode code point name and the decimalAlt code.
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | E | F | |
0_ | NUL | SOH | STX | ETX | EOT | ENQ | ACK | BEL | BS | HT | LF | VT | FF | CR | SO | SI |
1_ | DLE | DC1 | DC2 | DC3 | DC4 | NAK | SYN | ETB | CAN | EM | SUB | ESC | FS | GS | RS | US |
2_ | SP | ! | " | # | $ | % | & | ' | ( | ) | * | + | , | - | . | / |
3_ | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | : | ; | < | = | > | ? |
4_ | @ | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O |
5_ | P | Q | R | S | T | U | V | W | X | Y | Z | [ | \ | ] | ^ | _ |
6_ | ` | a | b | c | d | e | f | g | h | i | j | k | l | m | n | o |
7_ | p | q | r | s | t | u | v | w | x | y | z | { | | | } | ~ | DEL |
8_ | € 20AC |
‚ 201A |
ƒ 0192 |
„ 201E |
… 2026 |
† 2020 |
‡ 2021 |
ˆ 02C6 |
‰ 2030 |
Š 0160 |
‹ 2039 |
Œ 0152 |
Ž 017D |
|||
9_ | ‘ 2018 |
’ 2019 |
“ 201C |
” 201D |
• 2022 |
– 2013 |
— 2014 |
˜ 02DC |
™ 2122 |
š 0161 |
› 203A |
œ 0153 |
ž 017E |
Ÿ 0178 | ||
A_ | NBSP | ¡ | ¢ | £ | ¤ | ¥ | ¦ | § | ¨ | © | ª | « | ¬ | SHY | ® | ¯ |
B_ | ° | ± | ² | ³ | ´ | µ | ¶ | · | ¸ | ¹ | º | » | ¼ | ½ | ¾ | ¿ |
C_ | À | Á | Â | Ã | Ä | Å | Æ | Ç | È | É | Ê | Ë | Ì | Í | Î | Ï |
D_ | Ð | Ñ | Ò | Ó | Ô | Õ | Ö | × | Ø | Ù | Ú | Û | Ü | Ý | Þ | ß |
E_ | à | á | â | ã | ä | å | æ | ç | è | é | ê | ë | ì | í | î | ï |
F_ | ð | ñ | ò | ó | ô | õ | ö | ÷ | ø | ù | ú | û | ü | ý | þ | ÿ |
According to the information on Microsoft's and the Unicode Consortium's websites, positions 81, 8D, 8F, 90, and 9D are unused; however, the Windows APIMultiByteToWideChar
maps these to the correspondingC1 control codes.The "best fit" mapping documents this behavior, too.[22]
Related encodings
[edit]OS/2 extensions
[edit]TheOS/2operating system supports an encoding by the name ofCode page 1004(CCSID1004) or "Windows Extended".[27][28]This mostly matches code page 1252, with the exception of certainC0 control charactersbeing replaced bydiacriticcharacters.
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | E | F | |
0_ | NUL | SOH | STX | ETX | ˉ 02C9 |
˘ 02D8 |
˙ 02D9 |
BEL | ˚ 02DA |
HT | ˝ 02DD |
˛ 02DB |
ˇ 02C7 |
CR | SO | SI |
MS-DOS extensions (rare)
[edit]There is a rarely used, but useful, graphics extended code page 1252 where codes 0x00 to 0x1f allow for box drawing as used in applications such as MSDOS Edit and Codeview. One of the applications to use this code page was an Intel Corporation Install/Recovery disk image utility from mid/late 1995. These programs were written for its P6 User Test Program machines (US example[33]). It was used exclusively in its then EMEA region (Europe, Middle East & Africa). In time the programs were changed to usecode page 850.
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | E | F | |
0_ | ○ | ■ | ↑ | ↓ | → | ← | ║ | ═ | ╔ | ╗ | ╚ | ╝ | ░ | ▒ | ► | ◄ |
1_ | │ | ─ | ┌ | ┐ | └ | ┘ | ├ | ┤ | ┴ | ┬ | ♦ | ┼ | █ | ▄ | ▀ | ▬ |
Palm OS variant
[edit]EachPalm OSdevice supports a single language and a single character encoding, depending on its locale.[34]
For languages such as English and French, Palm OS uses a custom character encoding based on Windows-1252. For Japanese, it instead uses amultibyte character encodingbased oncode page 932.Regardless of the system locale, all characters in the range 0x00 to 0x7F are guaranteed to be the same, except 0x5D which is theYen signin Japanese and a backslash on all others.[34]
Palm OS 3.1 introduced several changes to the character encoding to better align with Windows-1252:[35]
- The special Palm OS glyphs "shortcut stroke" (0x9D) and "command stroke" (0x9E) were copied to 0x16 and 0x17, to ensure they were in the range guaranteed to be consistent between locales.[35]Starting in Palm OS 3.3, 0x16 and 0x17 are the only code points for those characters,[36]leaving 0x9D and 0x9E undefined.[37]
- Thenumeric space(0x80) andhorizontal ellipsis(0x85) were copied to 0x19 and 0x18 (respectively), to ensure they were in the range guaranteed to be consistent between locales.[35][36]
- TheEuro signwas added at 0x80, replacing what was previously the numeric space.[36]
- The playing card suits were copied to the font Symbol 9,[35]although their original code points remain valid.[36][37]
The following is the variant of Windows-1252 used byPalm OS3.3 onward for English and several other locales.[36]Pythongives it thepalmos
label, describing it as the encoding for Palm OS 3.5.[38][39]Differences from Windows-1252 have their Unicode code point.
0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | A | B | C | D | E | F | |
8_ | €[a] | ‚ | ƒ | „ | …[b] | † | ‡ | ˆ | ‰ | Š | ‹ | Œ | ♦ 2666 |
♣ 2663 |
♥ 2665 | |
9_ | ♠ 2660 |
‘ | ’ | “ | ” | • | – | — | ˜ | ™ | š | › | œ | [c] | [d] | Ÿ |
See also
[edit]- Latin script in Unicode
- Unicode
- Universal Coded Character Set
- UTF-8
- Western Latin character sets (computing)
- Windows-1250
- Windows code pages
- ISO/IEC JTC 1/SC 2
- Extended ASCII
Notes
[edit]- ^Prior to Palm OS 3.1, the character at code point 0x80 was U+2007 NUMERIC SPACE; starting in Palm OS 3.1, 0x80 is the Euro sign and 0x19 is U+2007 NUMERIC SPACE instead.[36]
- ^Starting in Palm OS 3.1, this character is also duplicated at 0x18.[35][36]
- ^Prior to Palm OS 3.3, this code point was the Palm OS-exclusive character "shortcut stroke"; starting in Palm OS 3.3, this code point is undefined.[35][36]
- ^Prior to Palm OS 3.3, this code point was the Palm OS-exclusive character "command stroke"; starting in Palm OS 3.3, this code point is undefined.[35][36]
References
[edit]- ^Character Sets,Internet Assigned Numbers Authority(IANA), 2018-12-12
- ^"Encoding. Living Standard".WHATWG.13 June 2024. § 9. Legacy single-byte encodings.Retrieved2024-06-28.
- ^Karl-Bridge-Microsoft (2021-10-26)."Code Pages - Win32 apps".learn.microsoft.Retrieved2024-10-09.
- ^ab"Historical trends in the usage statistics of character encodings for websites, December 2023".w3techs.Retrieved2024-07-19.
- ^ab"Encoding".WHATWG.27 January 2015. sec. 5.2 Names and labels.Archivedfrom the original on 4 February 2015.Retrieved4 February2015.
- ^"Frequenty Asked Questions".w3techs.
- ^"Distribution of Character Encodings among websites that use Brazil".W3Techs.Archivedfrom the original on 4 April 2024.Retrieved2024-07-19.
- ^"Distribution of Character Encodings among websites that use.de".W3Techs.Archivedfrom the original on 4 April 2024.Retrieved2024-07-19.
- ^"Distribution of Character Encodings among websites that use German".w3techs.Retrieved2023-01-16.
- ^Wissink, Cathy (5 April 2002)."Unicode and Windows XP"(PDF).Microsoft.p. 1. Archived fromthe original(PDF)on 4 February 2015.Retrieved4 February2015.
- ^"LaTeX News, Issue 28"(PDF; 379 KB).The LaTeX Project. Apr 2018.Retrieved2024-07-27.
- ^"Inputenc – Accept different input encodings".The LaTeX Project. 2024-02-08.Retrieved2024-07-27.
- ^"Code page 1252 information document".IBM. 30 September 1997. Archived fromthe originalon 2016-03-03.
- ^"CCSID 1252 information document".IBM. Archived fromthe originalon 2016-03-26.
- ^"CCSID 5348 information document".IBM. Archived fromthe originalon 2014-11-29.
- ^"Database Client Installation Guide".Oracle.Retrieved2021-02-14.
- ^Texin, Tex."Comparing Characters in Windows-1252, ISO-8859-1, ISO-8859-15".I18nQA.
- ^van Emden, Eva (28 January 2011)."How to make typographers' quotes in HTML".vancouvereditor.Retrieved7 January2024.
If you use typographers' quotes without specifying the right character encoding for your HTML file, some of your viewers are going to see question marks, boxes, or other crazy symbols instead of the beautiful curly quotes you intended them to see.
- ^"Smart quotes in Word".Microsoft Support.Microsoft.Retrieved7 January2024.
- ^"NetWare Web Search: Understanding Character Set Encodings".Novell Documentation.Novell.
if a document does not contain a CHARSET encoding value, the default encoding for HTML documents is ISO-8859-1, also known as Latin1. The default encoding for plain text documents is US-ASCII.
- ^Observed behavior in Chrome, this may be UTF-8 in some browsers.[original research?]
- ^ab"Unicode mappings of Windows-1252 with 'Best Fit'".Unicode.Archivedfrom the original on 4 February 2015.Retrieved4 February2015.
- ^Code Page 01252(PDF),IBM, 1998,archived(PDF)from the original on 27 October 2023
- ^Code Page (CPGID) 01252(txt),IBM, 1998,archivedfrom the original on 8 April 2023
- ^International Components for Unicode (ICU), ibm-1252_P100-2000.ucm,2002-12-03
- ^International Components for Unicode (ICU), ibm-5348_P100-1997.ucm,2002-12-03
- ^"Code page 1004 information document".Archived fromthe originalon 2015-06-25.
- ^"CCSID 1004 information document".Archived fromthe originalon 2016-03-26.
- ^"Code Page 01004"(PDF).IBM.Archived(PDF)from the original on 2015-07-08.(version based on Windows 3.1 version of Windows-1252)
- ^Code Page CPGID 01004 (pdf)(PDF),IBM
- ^Code Page CPGID 01004 (txt),IBM
- ^Borgendale, Ken (2001)."Codepage 1004 - Windows Extended".OS/2 codepages by number.Archivedfrom the original on 2018-05-13.Retrieved2018-05-13.(version based on current version of Windows-1252)
- ^Storaasli, Olaf (1996)."Performance of the NASA equation solvers on computational mechanics applications"(PDF).Performance of NASA Equation Solvers on Computational Mechanics Applications.NASA.doi:10.2514/6.1996-1505.S2CID15711051.Archived fromthe original(PDF)on 2019-05-03.
- ^ab"Chapter 13: Localized Applications".Palm OS Programmer's Companion(PDF).Palm Computing Platform. March 16, 2000. p. 321.
- ^abcdefg"Appendix B: Compatibility Guide".Palm OS SDK Reference(PDF).Palm Computing Platform. March 16, 2000. pp. 1181–1182.
- ^abcdefghiWalleij, Linus."Palm Pilot Character Sets And Unicode Mappings".GNU Recode.Datorföreningen vid Lunds Universitet och Lunds Tekniska Högskola.Retrieved10 October2023.
- ^abcParker, Greg."Palm OS Built-in Fonts".Sealie Software.Retrieved10 October2023.
- ^"codecs—Codec registry and base classes (§ Text Encodings)".The Python Standard Library—Python 3.9.4 Documentation.Python Software Foundation.
- ^abMullender, Sjoerd (13 July 2002)."Python Character Mapping Codec for Palm OS 3.5".CPythonsource tree.Python Software Foundation.Retrieved9 December2021.
External links
[edit]- Microsoft'scode charts for Windows-1252 ( "Code Page 1252 Windows Latin 1 (ANSI)" )
- Unicode mapping tableandcode page definition with best fit mappingsfor Windows-1252