SSE3

SSE3,Streaming SIMD Extensions 3,also known by itsIntelcode namePrescott New Instructions(PNI),^[1]is the third iteration of theSSEinstruction set for theIA-32(x86) architecture. Intel introduced SSE3 in early 2004 with thePrescottrevision of theirPentium 4CPU.^[1]In April 2005,AMDintroduced a subset of SSE3 in revision E (Venice and San Diego) of theirAthlon 64CPUs.^[2]The earlierSIMDinstruction sets on thex86platform, from oldest to newest, areMMX,3DNow!(developed by AMD, no longer supported on newer CPUs),SSE,andSSE2.

SSE3 contains 13 new instructions overSSE2.^[3]

Changes

The most notable change is the capability to work horizontally in a register, as opposed to the more or less strictly vertical operation of all previous SSE instructions. More specifically, instructions to add and subtract the multiple values stored within a single register have been added.^[4]These instructions can be used to speed up the implementation of a number ofDSPand3Doperations. There is also a new instruction to convert floating point values to integers without having to change the global rounding mode, thus avoiding costlypipelinestalls. Finally, the extension addsLDDQU,an alternative misaligned integer vector load that has better performance onNetBurstbased platforms for loads that cross cacheline boundaries.^[5]

CPUs with SSE3

AMD:
- Opteron(since Stepping E4^[6])
- Sempron(since Palermo. Stepping E3)
- Athlon 64(since Venice Stepping E3 and San Diego Stepping E4)
- Athlon 64 FX(since San Diego Stepping E4)
- Athlon 64 X2
- Phenom 64 X2
- Turionfamily
- K10family
- APUfamily (including without GPU)
- FX Series
- Zenfamily
Intel:
- Celeron D
- Celeron(starting with Core microarchitecture)
- Pentium 4(since Prescott)
- Pentium D
- Pentium Extreme Edition(but NOT Pentium 4 Extreme Edition)
- Pentium Dual-Core
- Pentium(starting with Core microarchitecture)
- Core
- Xeon(since Nocona^[7])
- Atom
VIA/Centaur:
- C7
- Nano
Transmeta EfficeonTM88xx (NOT Model Numbers TM86xx)

New instructions

Common instructions

Arithmetic

ADDSUBPD

Add-Subtract-Packed-Double^[8]

Input: { A0, A1 }, { B0, B1 }
Output: { A0 − B0, A1 + B1 }

ADDSUBPS

Add-Subtract-Packed-Single^[8]

Input: { A0, A1, A2, A3 }, { B0, B1, B2, B3 }
Output: { A0 − B0, A1 + B1, A2 − B2, A3 + B3 }

AOS ( Array Of Structures )

HADDPD

Horizontal-Add-Packed-Double^[8]

Input: { A0, A1 }, { B0, B1 }
Output: { A0 + A1, B0 + B1 }

HADDPS

Horizontal-Add-Packed-Single^[8]

Input: { A0, A1, A2, A3 }, { B0, B1, B2, B3 }
Output: { A0 + A1, A2 + A3, B0 + B1, B2 + B3 }

HSUBPD

Horizontal-Subtract-Packed-Double^[8]

Input: { A0, A1 }, { B0, B1 }
Output: { A0 − A1, B0 − B1 }

HSUBPS

Horizontal-Subtract-Packed-Single^[8]

Input: { A0, A1, A2, A3 }, { B0, B1, B2, B3 }
Output: { A0 − A1, A2 − A3, B0 − B1, B2 − B3 }

LDDQU: As stated above, this is an alternative misaligned integer vector load.^[8]It can be helpful for video compression tasks.
MOVDDUP,MOVSHDUP,MOVSLDUP^[4]: These are useful for complex numbers and wave calculation like sound.
FISTTP: Like the older x87FISTPinstruction, but ignores the floating point control register's rounding mode settings and uses the "chop" (truncate) mode instead.^[4]Allows omission of the expensive loading and re-loading of the control register in languages such as C where float-to-int conversion requires truncate behaviour by standard.

Other instructions

MONITOR,MWAIT: TheMONITORinstruction is used to specify a memory address for monitoring, while theMWAITinstruction puts the processor into a low-power state and waits for a write event to the monitored address.^[4]

References

^^a ^bShimpi, Anand Lal; Wilson, Derek."Intel's Pentium 4 E: Prescott Arrives with Luggage".anandtech.Retrieved2023-04-10.
^Shimpi, Anand Lal."Industry Update - Q4-2004: AMD adds SSE3 Support, Intel's 925/915 not selling and more".anandtech.Retrieved2023-04-10.
^"Intel Instruction Set Extensions Technology".Intel.Retrieved2023-04-10.
^^a ^b ^c ^dWright, Christopher."SSE3 Instruction Set".softpixel.Retrieved2023-04-10.
^"LDDQU — Load Unaligned Integer 128 Bits".felixcloutier.Retrieved2023-04-10.
^Wilson, Derek."AMD K8 E4 Stepping: SSE3 Performance".anandtech.Retrieved2023-04-10.
^"Intel Xeon 3.4GHz ['Nocona' core]".HEXUS.2004-08-18.Retrieved2023-04-10.
^^a ^b ^c ^d ^e ^f ^g"SSE3 Instructions - x86 Assembly Language Reference Manual".docs.oracle.Retrieved2023-04-10.

External links

X-bit Labs

[:1-1] Shimpi, Anand Lal; Wilson, Derek."Intel's Pentium 4 E: Prescott Arrives with Luggage".anandtech.Retrieved2023-04-10.

[2] Shimpi, Anand Lal."Industry Update - Q4-2004: AMD adds SSE3 Support, Intel's 925/915 not selling and more".anandtech.Retrieved2023-04-10.

[3] "Intel Instruction Set Extensions Technology".Intel.Retrieved2023-04-10.

[:2-4] Wright, Christopher."SSE3 Instruction Set".softpixel.Retrieved2023-04-10.

[5] "LDDQU — Load Unaligned Integer 128 Bits".felixcloutier.Retrieved2023-04-10.

[6] Wilson, Derek."AMD K8 E4 Stepping: SSE3 Performance".anandtech.Retrieved2023-04-10.

[7] "Intel Xeon 3.4GHz ['Nocona' core]".HEXUS.2004-08-18.Retrieved2023-04-10.

[:0-8] ^^a ^b ^c ^d ^e ^f ^g"SSE3 Instructions - x86 Assembly Language Reference Manual".docs.oracle.Retrieved2023-04-10.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

v t e Instruction set extensions
SIMD(RISC)	Alpha MVI ARM NEON SVE MIPS MDMX MIPS-3D MXU MIPS SIMD PA-RISC MAX Power ISA VMX SPARC VIS
SIMD(x86)	MMX(1996) 3DNow!(1998) SSE(1999) SSE2(2001) SSE3(2004) SSSE3(2006) SSE4(2006) SSE5~~(2007)~~ AVX(2008) F16C(2009) XOP(2009) FMA(FMA4: 2011, FMA3: 2012) AVX2(2013) AVX-512(2015) AMX(2022) AVX10(2023)
Bit manipulation	BMI(ABM: 2007, BMI1: 2012, BMI2: 2013, TBM: 2012) ADX(2014)
Compressed instructions	Thumb MIPS16e ASE RVC
Security andcryptography	PadLock(2003) AES-NI(2008); ARMv8 also has AES instructions CLMUL(2010) RDRAND(2012) SHA(2013) MPX(2015) SGX(2015) TDX(2021)
Transactional memory	TSX(2013) ASF
Virtualization	VT-x(2005) AMD-V(2006) VT-d(AMD-Vi)
Suspended extensions' dates are~~struck through~~.