Jump to content

SSE4

From Wikipedia, the free encyclopedia

SSE4(Streaming SIMD Extensions 4) is aSIMDCPUinstruction setused in theIntelCore microarchitectureandAMD K10 (K8L).It was announced on September 27, 2006, at the Fall 2006Intel Developer Forum,with vague details in awhite paper;[1]more precise details of 47 instructions became available at the Spring 2007 Intel Developer Forum inBeijing,in the presentation.[2]SSE4 extended theSSE3instruction set which was released in early 2004. All software using previous Intel SIMD instructions (ex. SSE3) are compatible with modern microprocessors supporting SSE4 instructions. All existing software continues to run correctly without modification on microprocessors that incorporate SSE4, as well as in the presence of existing and new applications that incorporate SSE4.[3]

Like other previous generation CPU SIMD instruction sets, SSE4 supports up to 16 registers, each 128-bits wide which can load four 32-bit integers, four 32-bit single precision floating point numbers, or two 64-bit double precision floating point numbers.[1]SIMD operations, such as vector element-wise addition/multiplication and vector scalar addition/multiplication, process multiple bytes of data in a single CPU instruction. The parallel operation packs noticeable increases in performance. SSE4.2 introduced new SIMD string operations, including an instruction to compare two string fragments of up to 16 bytes each.[1]SSE4.2 is a subset of SSE4 and it was released a few years after the initial release of SSE4.

SSE4 subsets[edit]

Intel SSE4 consists of 54 instructions. A subset consisting of 47 instructions, referred to asSSE4.1in some Intel documentation, is available inPenryn.Additionally,SSE4.2,a second subset consisting of the seven remaining instructions, is first available inNehalem-basedCore i7.Intel credits feedback from developers as playing an important role in the development of the instruction set.

Starting withBarcelona-based processors,AMDintroduced theSSE4ainstruction set, which has four SSE4 instructions and four new SSE instructions. These instructions are not found in Intel's processors supporting SSE4.1 and AMD processors only started supporting Intel's SSE4.1 and SSE4.2 (the full SSE4 instruction set) in theBulldozer-based FX processors. With SSE4a the misaligned SSE feature was also introduced which meant unaligned load instructions were as fast as aligned versions on aligned addresses. It also allowed disabling the alignment check on non-load SSE operations accessing memory.[4]Intel later introduced similar speed improvements to unaligned SSE in their Nehalem processors, but did not introduce misaligned access by non-load SSE instructions untilAVX.[5]

Name confusion[edit]

What is now known asSSSE3(Supplemental StreamingSIMDExtensions 3), introduced in theIntel Core 2processor line, was referred to as SSE4 by some media until Intel came up with the SSSE3 moniker. Internally dubbed Merom New Instructions, Intel originally did not plan to assign a special name to them, which was criticized by some journalists.[6]Intel eventually cleared up the confusion and reserved the SSE4 name for their next instruction set extension.[7]

Intel is using the marketing termHD Boostto refer to SSE4.[8]

New instructions[edit]

Unlike all previous iterations of SSE, SSE4 contains instructions that execute operations which are not specific to multimedia applications. It features a number of instructions whose action is determined by a constant field and a set of instructions that take XMM0 as an implicit third operand.

Several of these instructions are enabled by the single-cycle shuffle engine in Penryn. (Shuffle operations reorder bytes within a register.)

SSE4.1[edit]

These instructions were introduced withPenryn microarchitecture,the 45 nm shrink of Intel'sCore microarchitecture.Support is indicated via the CPUID.01H:ECX.SSE41[Bit 19] flag.

Instruction Description
MPSADBW Compute eight offset sums of absolute differences, four at a time (i.e., |x0−y0|+|x1−y1|+|x2−y2|+|x3−y3|, |x0−y1|+|x1−y2|+|x2−y3|+|x3−y4|,..., |x0−y7|+|x1−y8|+|x2−y9|+|x3−y10|); this operation is important for someHDcodecs,and allows an 8×8 block difference to be computed in fewer than seven cycles.[9]One bit of a three-bit immediate operand indicates whether y0.. y10or y4.. y14should be used from the destination operand, the other two whether x0..x3,x4..x7,x8..x11or x12..x15should be used from the source.
PHMINPOSUW Sets the bottom unsigned 16-bit word of the destination to the smallest unsigned 16-bit word in the source, and the next-from-bottom to the index of that word in the source.
PMULDQ Packed 32-bit signed "long" multiplication, two (1st and 3rd) out of four packed integers multiplied giving two packed 64-bit results.
PMULLD Packed 32-bit signed "low" multiplication, four packed sets of integers multiplied giving four packed 32-bit results.
DPPS,DPPD Dot productfor AOS (Array of Structs) data. This takes an immediate operand consisting of four (or two for DPPD) bits to select which of the entries in the input to multiply and accumulate, and another four (or two for DPPD) to select whether to put 0 or the dot-product in the appropriate field of the output.
BLENDPS,BLENDPD,BLENDVPS,BLENDVPD,PBLENDVB,PBLENDW Conditional copying of elements in one location with another, based (for non-V form) on the bits in an immediate operand, and (for V form) on the bits in register XMM0.
PMINSB,PMAXSB,PMINUW,PMAXUW,PMINUD,PMAXUD,PMINSD,PMAXSD Packed minimum/maximum for different integer operand types
ROUNDPS,ROUNDSS,ROUNDPD,ROUNDSD Round values in a floating-point register to integers, using one of four rounding modes specified by an immediate operand
INSERTPS,PINSRB,PINSRD/PINSRQ,EXTRACTPS,PEXTRB,PEXTRD/PEXTRQ The INSERTPS and PINSR instructions read 8, 16 or 32 bits from an x86 register or memory location and inserts it into a field in the destination register given by an immediate operand. EXTRACTPS and PEXTR read a field from the source register and insert it into an x86 register or memory location. For example, PEXTRD eax, [xmm0], 1; EXTRACTPS [addr+4*eax], xmm1, 1 stores the first field of xmm1 in the address given by the first field of xmm0.
PMOVSXBW,PMOVZXBW,PMOVSXBD,PMOVZXBD,PMOVSXBQ,PMOVZXBQ,PMOVSXWD,PMOVZXWD,PMOVSXWQ,PMOVZXWQ,PMOVSXDQ,PMOVZXDQ Packed sign/zero extension to wider types
PTEST This is similar to theTESTinstruction, in that it sets theZ flagto the result of an AND between its operands: ZF is set, if DEST AND SRC is equal to 0. Additionally it sets the C flag if (NOT DEST) AND SRC equals zero.

This is equivalent to setting the Z flag if none of the bits masked by SRC are set, and the C flag if all of the bits masked by SRC are set.

PCMPEQQ Quadword (64 bits) compare for equality
PACKUSDW Convert signed DWORDs into unsigned WORDs with saturation.
MOVNTDQA Efficient read from write-combining memory area into SSE register; this is useful for retrieving results from peripherals attached to the memory bus.

SSE4.2[edit]

SSE4.2 added STTNI (String and Text New Instructions),[10]several new instructions that perform character searches and comparison on two operands of 16 bytes at a time. These were designed (among other things) to speed up the parsing ofXMLdocuments.[11]It also added aCRC32instruction to computecyclic redundancy checksas used in certain data transfer protocols. These instructions were first implemented in theNehalem-basedIntel Core i7product line, and complete the SSE4 instruction set. AMD on the other hand first added support starting with theBulldozer microarchitecture.Support is indicated via the CPUID.01H:ECX.SSE42[Bit 20] flag.

Windows 11 24H2requires the CPU to support SSE4.2, otherwise the Windows kernel is unbootable.[12]

Instruction Description
CRC32 AccumulateCRC32C value using the polynomial 0x11EDC6F41 (or, without the high order bit, 0x1EDC6F41).[13][14]
PCMPESTRI Packed Compare Explicit Length Strings, Return Index
PCMPESTRM Packed Compare Explicit Length Strings, Return Mask
PCMPISTRI Packed Compare Implicit Length Strings, Return Index
PCMPISTRM Packed Compare Implicit Length Strings, Return Mask
PCMPGTQ Compare Packed Signed 64-bit data For Greater Than

POPCNTandLZCNT[edit]

These instructions operate on integer rather than SSE registers, because they are not SIMD instructions, but appear at the same time and although introduced by AMD with the SSE4a instruction set, they are counted as separate extensions with their own dedicated CPUID bits to indicate support. Intel implementsPOPCNTbeginning with theNehalemmicroarchitecture andLZCNTbeginning with theHaswellmicroarchitecture. AMD implements both, beginning with theBarcelona microarchitecture.

AMD calls this pair of instructionsAdvanced Bit Manipulation(ABM).

Instruction Description
POPCNT Population count(count number of bits set to 1). Support is indicated via the CPUID.01H:ECX.POPCNT[Bit 23] flag.[15]
LZCNT Leading zero count.Support is indicated via the CPUID.80000001H:ECX.ABM[Bit 5] flag.[16]

The encoding ofLZCNTtakes the same encoding path as the encoding of theBSR(bit scan reverse) instruction. This results in an issue whereLZCNTcalled on some CPUs not supporting it, such as Intel CPUs prior to Haswell, may incorrectly execute theBSRoperation instead of raising aninvalid instructionexception. This is an issue as the result values ofLZCNTandBSRare different.

Trailing zeros can be counted using theBSF(bit scan forward) orTZCNTinstructions.

Windows 11 24H2requires the CPU to supportPOPCNT,otherwise the Windows kernel is unbootable.[17]

SSE4a[edit]

The SSE4a instruction group was introduced in AMD'sBarcelona microarchitecture.These instructions are not available in Intel processors. Support is indicated via the CPUID.80000001H:ECX.SSE4A[Bit 6] flag.[16]

Instruction Description
EXTRQ/INSERTQ Combined mask-shift instructions.[18]
MOVNTSD/MOVNTSS Scalar streaming store instructions.[19]

Supporting CPUs[edit]

  • Intel
  • AMD
    • K10-basedprocessors (SSE4a,POPCNTandLZCNTsupported)
    • "Cat" low-power processors
      • Bobcat-basedprocessors (SSE4a,POPCNTandLZCNTsupported)
      • Jaguar-basedprocessors and newer (SSE4a, SSE4.1, SSE4.2,POPCNTandLZCNTsupported)
      • Puma-basedprocessors and newer (SSE4a, SSE4.1, SSE4.2,POPCNTandLZCNTsupported)
    • "Heavy Equipment" processors (SSE4a, SSE4.1, SSE4.2,POPCNTandLZCNTsupported)
    • Zen-basedprocessors (SSE4a, SSE4.1, SSE4.2,POPCNTandLZCNTsupported)
    • Zen+-basedprocessors (SSE4a, SSE4.1, SSE4.2,POPCNTandLZCNTsupported)
    • Zen2-basedprocessors (SSE4a, SSE4.1, SSE4.2,POPCNTandLZCNTsupported)
    • Zen3-basedprocessors (SSE4a, SSE4.1, SSE4.2,POPCNTandLZCNTsupported)
    • Zen4-basedprocessors (SSE4a, SSE4.1, SSE4.2,POPCNTandLZCNTsupported)
  • VIA
    • Nano3000, X2, QuadCore processors (SSE4.1 supported)
    • NanoQuadCore C4000-series processors (SSE4.1, SSE4.2 supported)
    • EdenX4 processors (SSE4.1, SSE4.2 supported)
  • Zhaoxin
    • ZX-C processors and newer (SSE4.1, SSE4.2 supported)

References[edit]

  1. ^abcIntel Streaming SIMD Extensions 4 (SSE4) Instruction Set InnovationArchivedMay 30, 2009, at theWayback Machine,Intel.
  2. ^Tuning for Intel SSE4 for the 45nm Next Generation Intel Core MicroarchitectureArchivedMarch 8, 2021, at theWayback Machine,Intel.
  3. ^"Intel SSE4 Programming Reference"(PDF).Archived(PDF)from the original on February 15, 2020.RetrievedDecember 26,2014.
  4. ^""Barcelona" Processor Feature: SSE Misaligned Access ".AMD. Archived fromthe originalon August 9, 2016.RetrievedMarch 3,2015.
  5. ^"Inside Intel Nehalem Microarchitecture".Archivedfrom the original on April 2, 2015.RetrievedMarch 3,2015.
  6. ^My Experience With "Conroe"ArchivedOctober 15, 2013, at theWayback Machine,DailyTech
  7. ^Extending the World’s Most Popular Processor ArchitectureArchivedNovember 24, 2011, at theWayback Machine,Intel
  8. ^"Intel - Data Center Solutions, IOT, and PC Innovation".Intel.Archivedfrom the original on February 7, 2013.RetrievedSeptember 17,2009.
  9. ^Motion Estimation with Intel Streaming SIMD Extensions 4 (Intel SSE4)ArchivedJune 16, 2018, at theWayback Machine,Intel.
  10. ^"Schema Validation with Intel® Streaming SIMD Extensions 4 (Intel® SSE4)".Archivedfrom the original on June 17, 2018.RetrievedFebruary 6,2012.
  11. ^"XML Parsing Accelerator with Intel® Streaming SIMD Extensions 4 (Intel® SSE4)".Archivedfrom the original on June 17, 2018.RetrievedFebruary 6,2012.
  12. ^Klotz, Aaron (April 24, 2024)."Microsoft blocks some PCs from Windows 11 24H2 — CPU must support SSE4.2 or the OS will not boot".Tom's Hardware.RetrievedApril 29,2024.
  13. ^Intel SSE4 Programming ReferenceArchivedFebruary 15, 2020, at theWayback Machinep. 61. See alsoRFC 3385ArchivedJune 19, 2008, at theWayback Machinefor discussion of the CRC32C polynomial.
  14. ^Fast, Parallelized CRC Computation Using the Nehalem CRC32 Instruction— Dr. Dobbs, April 12, 2011
  15. ^Intel® 64 and IA-32 Architectures Software Developer's Manual, Volume 2B: Instruction Set Reference, N–ZArchivedMarch 8, 2011, at theWayback Machine.
  16. ^ab"AMD CPUID Specification"(PDF).Archived(PDF)from the original on November 1, 2013.RetrievedOctober 30,2013.
  17. ^Sen, Sayan (March 17, 2024)."Microsoft fixes a misfired PopCnt block but Windows 11 24H2 requirements may be here to stay".Neowin.RetrievedMarch 17,2024.
  18. ^Rahul Chaturvedi (September 17, 2007).""Barcelona" Processor Feature: SSE4a Instruction Set ".Archived fromthe originalon October 25, 2013.
  19. ^Rahul Chaturvedi (October 2, 2007).""Barcelona" Processor Feature: SSE4a, part 2 ".Archived fromthe originalon October 25, 2013.
  20. ^"AMD FX-Series FX-6300 - FD6300WMW6KHK / FD6300WMHKBOX".Archivedfrom the original on August 17, 2017.RetrievedOctober 9,2015.

External links[edit]