Jump to content

Machine code

From Wikipedia, the free encyclopedia
(Redirected fromMachine language)

Machine language monitor running on aW65C816Smicroprocessor,displayingcode disassembly,as well as processor register and memory dumps

Incomputer programming,machine codeiscomputer codeconsisting ofmachine languageinstructions,which are used to control a computer'scentral processing unit(CPU). Althoughdecimal computerswere once common, the contemporary marketplace is dominated bybinary computers;for those computers, machine code is "the binary representation of a computer program which is actually read and interpreted by the computer. A program in machine code consists of a sequence of machine instructions (possibly interspersed with data)."[1]

Each instruction causes the CPU to perform a very specific task, such as a load, a store, ajump,or anarithmetic logic unit(ALU) operation on one or more units of data in the CPU'sregistersormemory.

Early CPUs had specific machine code that might breakbackward compatibilitywith each new CPU released. The notion of aninstruction set architecture(ISA) defines and specifies the behavior and encoding in memory of the instruction set of the system, without specifying its exact implementation. This acts as an abstraction layer, enabling compatibility within the same family of CPUs, so that machine code written or generated according to the ISA for the family will run on all CPUs in the family, including future CPUs.

In general, each architecture family (e.g.x86,ARM) has its own ISA, and hence its own specific machine code language. There are exceptions, such as theVAXarchitecture, which included optional support of thePDP-11instruction set andIA-64,which included optional support of theIA-32instruction set. Another example is thePowerPC 615,a processor designed to natively process both PowerPC and x86 instructions.

Machine code is a strictly numerical language, and is the lowest-level interface to the CPU intended for a programmer.Assembly languageprovides a direct mapping between the numerical machine code and a human-readable version where numericalopcodesand operands are replaced by readable strings (e.g. 0x90 as theNOP instructiononx86,with 0xB8 being the MOV instruction, 0xE8 meaning CALL or 0x0F05 standing for the SYSCALL instruction). While it is possible to write programs directly in machine code, managing individual bits and calculating numericaladdressesand constants manually is tedious and error-prone. For this reason, programs are very rarely written directly in machine code in modern contexts, but may be done for low-leveldebugging,programpatching(especially when assembler source is not available) and assembly languagedisassembly.

The majority of practical programs today are written inhigher-level languages.Those programs are either translated into machine code by acompiler,or are interpreted by aninterpreter,usually after being translated into an intermediate code, such as abytecode,that is then interpreted.[nb 1]

Machine code is by definition the lowest level of programming detail visible to the programmer, but internally many processors usemicrocodeor optimize and transform machine code instructions into sequences ofmicro-ops.Microcode and micro-ops are not generally considered to be machine code; except on some machines, the user cannot write microcode or micro-ops, and the operation of microcode and the transformation of machine-code instructions into micro-ops happens transparently to the programmer except for performance related side effects.

Instruction set[edit]

Every processor or processor family has its owninstruction set.Instructions are patterns ofbits,digits, or characters that correspond to machine commands. Thus, the instruction set is specific to a class of processors using (mostly) the samearchitecture.Successor or derivative processor designs often include instructions of a predecessor and may add new additional instructions. Occasionally, a successor design will discontinue or alter the meaning of some instruction code (typically because it is needed for new purposes), affecting code compatibility to some extent; even compatible processors may show slightly different behavior for some instructions, but this is rarely a problem. Systems may also differ in other details, such as memory arrangement, operating systems, orperipheral devices.Because a program normally relies on such factors, different systems will typically not run the same machine code, even when the same type of processor is used.

A processor's instruction set may have fixed-length or variable-length instructions. How the patterns are organized varies with the particular architecture and type of instruction. Most instructions have one or moreopcodefields that specify the basic instruction type (such as arithmetic, logical,jump,etc.), the operation (such as add or compare), and other fields that may give the type of theoperand(s), theaddressing mode(s), the addressing offset(s) or index, or the operand value itself (such constant operands contained in an instruction are calledimmediate).[2]

Not all machines or individual instructions have explicit operands. On a machine with a singleaccumulator,the accumulator is implicitly both the left operand and result of most arithmetic instructions. Some other architectures, such as thex86architecture, have accumulator versions of common instructions, with the accumulator regarded as one of the general registers by longer instructions. Astack machinehas most or all of its operands on an implicit stack. Special purpose instructions also often lack explicit operands; for example, CPUID in the x86 architecture writes values into four implicit destination registers. This distinction between explicit and implicit operands is important in code generators, especially in theregister allocationand live range tracking parts. A good code optimizer can track implicit as well as explicit operands which may allow more frequentconstant propagation,constant foldingof registers (a register assigned the result of a constant expression freed up by replacing it by that constant) and other code enhancements.

Programs[edit]

Acomputer programis a list of instructions that can be executed by acentral processing unit(CPU). A program's execution is done in order for the CPU that is executing it to solve a problem and thus accomplish a result. While simple processors are able to execute instructions one after another,superscalarprocessors are able under certain circumstances (when the pipeline is full) of executing two or more instructions simultaneously. As an example, the originalIntel Pentiumfrom 1993 can execute at most two instructions per clock cycle when its pipeline is full.

Program flowmay be influenced by special 'jump' instructions that transfer execution to an address (and hence instruction) other than the next numerically sequential address. Whether theseconditional jumpsoccur is dependent upon a condition such as a value being greater than, less than, or equal to another value.

Assembly languages[edit]

A much more human friendly rendition of machine language, calledassembly language,usesmnemonic codesto refer to machine code instructions, rather than using the instructions' numeric values directly, and usessymbolic namesto refer to storage locations and sometimesregisters.[3]For example, on theZilog Z80processor, the machine code00000101,which causes the CPU to decrement theBgeneral-purpose register,would be represented in assembly language asDEC B.[4]

Example[edit]

TheMIPS architectureprovides a specific example for a machine code whose instructions are always 32 bits long.[5]: 299 The general type of instruction is given by theop(operation) field, the highest 6 bits. J-type (jump) and I-type (immediate) instructions are fully specified byop.R-type (register) instructions include an additional fieldfunctto determine the exact operation. The fields used in these types are:

6 5 5 5 5 6 bits
[ op | rs | rt | rd |shamt| funct] R-type
[ op | rs | rt | address/immediate] I-type
[ op | target address ] J-type

rs,rt,andrdindicate register operands;shamtgives a shift amount; and theaddressorimmediatefields contain an operand directly.[5]: 299–301 

For example, adding the registers 1 and 2 and placing the result in register 6 is encoded:[5]: 554 

[ op | rs | rt | rd |shamt| funct]
0 1 2 6 0 32 decimal
000000 00001 00010 00110 00000 100000 binary

Load a value into register 8, taken from the memory cell 68 cells after the location listed in register 3:[5]: 552 

[ op | rs | rt | address/immediate]
35 3 8 68 decimal
100011 00011 01000 00000 00001 000100 binary

Jumping to the address 1024:[5]: 552 

[ op | target address ]
2 1024 decimal
000010 00000 00000 00000 10000 000000 binary

Overlapping instructions[edit]

On processor architectures withvariable-length instruction sets[6](such asIntel'sx86processor family) it is, within the limits of the control-flowresynchronizingphenomenon known as theKruskal count,[7][6][8][9][10]sometimes possible through opcode-level programming to deliberately arrange the resulting code so that two code paths share a common fragment of opcode sequences.[nb 2]These are calledoverlapping instructions,overlapping opcodes,overlapping code,overlapped code,instruction scission,orjump into the middle of an instruction,and represent a form ofsuperposition.[11][12][13]

In the 1970s and 1980s, overlapping instructions were sometimes used to preserve memory space. One example were in the implementation of error tables inMicrosoft'sAltair BASIC,whereinterleaved instructionsmutually shared their instruction bytes.[14][6][11]The technique is rarely used today, but might still be necessary to resort to in areas where extreme optimization for size is necessary on byte-level such as in the implementation ofboot loaderswhich have to fit intoboot sectors.[nb 3]

It is also sometimes used as acode obfuscationtechnique as a measure againstdisassemblyand tampering.[6][9]

The principle is also utilized in shared code sequences offat binarieswhich must run on multiple instruction-set-incompatible processor platforms.[nb 2]

This property is also used to findunintended instructionscalledgadgetsin existing code repositories and is utilized inreturn-oriented programmingas alternative tocode injectionfor exploits such asreturn-to-libc attacks.[15][6]

Relationship to microcode[edit]

In some computers, the machine code of thearchitectureis implemented by an even more fundamental underlying layer calledmicrocode,providing a common machine language interface across a line or family of different models of computer with widely different underlyingdataflows.This is done to facilitateportingof machine language programs between different models. An example of this use is the IBMSystem/360family of computers and their successors. With dataflow path widths of 8 bits to 64 bits and beyond, they nevertheless present a common architecture at the machine language level across the entire line.

Using microcode to implement anemulatorenables the computer to present the architecture of an entirely different computer. The System/360 line used this to allow porting programs from earlier IBM machines to the new family of computers, e.g. anIBM 1401/1440/1460emulator on the IBM S/360 model 40.

Relationship to bytecode[edit]

Machine code is generally different frombytecode(also known as p-code), which is either executed by an interpreter or itself compiled into machine code for faster (direct) execution. An exception is when a processor is designed to use a particular bytecode directly as its machine code, such as is the case withJava processors.

Machine code and assembly code are sometimes callednativecodewhen referring to platform-dependent parts of language features or libraries.[16]

Storing in memory[edit]

From the point of view of the CPU, machine code is stored in RAM, but is typically also kept in a set of caches for performance reasons. There may be different caches for instructions and data, depending on the architecture.

The CPU knows what machine code to execute, based on its internal program counter. The program counter points to a memory address and is changed based on special instructions which may cause programmatic branches. The program counter is typically set to a hard coded value when the CPU is first powered on, and will hence execute whatever machine code happens to be at this address.

Similarly, the program counter can be set to execute whatever machine code is at some arbitrary address, even if this is not valid machine code. This will typically trigger an architecture specific protection fault.

The CPU is oftentimes told, by page permissions in a paging based system, if the current page actually holds machine code by an execute bit — pages have multiple such permission bits (readable, writable, etc.) for various housekeeping functionality. E.g. onUnix-likesystems memory pages can be toggled to be executable with themprotect()system call, and on Windows,VirtualProtect()can be used to achieve a similar result. If an attempt is made to execute machine code on a non-executable page, an architecture specific fault will typically occur. Treatingdata as machine code,or finding new ways to use existing machine code, by various techniques, is the basis of some security vulnerabilities.

Similarly, in a segment based system, segment descriptors can indicate whether a segment can contain executable code and in whatringsthat code can run.

From the point of view of aprocess,thecode spaceis the part of itsaddress spacewhere the code in execution is stored. Inmultitaskingsystems this comprises the program'scode segmentand usuallyshared libraries.Inmulti-threadingenvironment, different threads of one process share code space along with data space, which reduces the overhead ofcontext switchingconsiderably as compared to process switching.

Readability by humans[edit]

Various tools and methods exist to decodemachine codeback to its correspondingsource code.

Machine code can easily be decoded back to its correspondingassembly languagesource code because assembly language forms a one-to-one mapping to machine code.[17]The assembly language decoding method is calleddisassembly.

Machine code may be decoded back to its correspondinghigh-level languageunder two conditions:

The first condition is to accept anobfuscatedreading of the source code. An obfuscated version of source code is displayed if the machine code is sent to adecompilerof the source language.

The second condition requires the machine code to have information about the source code encoded within. The information includes asymbol tablethat containsdebug symbols.The symbol table may be stored within the executable, or it may exist in separate files. Adebuggercan then read the symbol table to help the programmer interactivelydebugthe machine code inexecution.

See also[edit]

Notes[edit]

  1. ^Such as many versions ofBASIC,especially early ones, as well asSmalltalk,MATLAB,Perl,Python,Rubyand other special purpose orscripting languages.
  2. ^abWhile overlapping instructions on processor architectures withvariable-length instruction setscan sometimes be arranged to merge different code paths back into one through control-flowresynchronization,overlapping code for different processor architectures can sometimes also be crafted to cause execution paths to branch into different directions depending on the underlying processor, as is sometimes utilized infat binaries.
  3. ^As an example, theDR-DOSMBRsandboot sectors(which also hold thepartition tableandBIOS Parameter Block,leaving less than 446 respectively 423 bytes for the code) were traditionally able to locate the boot file in theFAT12orFAT16file systemby themselves and load it into memory as a whole, in contrast to their counterparts inMS-DOS/PC DOS,which instead relied on thesystem filesto occupy the first twodirectory entriesin the file system and the first three sectors ofIBMBIO.COMto be stored at the start of the data area in contiguous sectors containing a secondary loader to load the remainder of the file into memory (requiringSYSto take care of all these conditions). WhenFAT32andLBAsupport was added,Microsofteven switched to require386instructions and split the boot code over two sectors for code size reasons, which was no option to follow for DR-DOS as it would have brokenbackward- and cross-compatibility with other operating systems inmulti-bootandchain loadscenarios, as well as with olderPCs.Instead, theDR-DOS 7.07boot sectors resorted toself-modifying code,opcode-level programming in machine language, controlled utilization of (documented)side effects,multi-level data/code overlapping and algorithmicfoldingtechniques to still fit everything into a physical sector of only 512 bytes without giving up any of their extended functionality.

References[edit]

  1. ^Stallings, William.Computer Organization and Architecture 10th edition.p. 776.ISBN9789332570405.
  2. ^Kjell, Bradley."Immediate Operand".
  3. ^Dourish, Paul(2004).Where the Action is: The Foundations of Embodied Interaction.MIT Press.p. 7.ISBN0-262-54178-5.Retrieved2023-03-05.
  4. ^Zaks, Rodnay(1982).Programming the Z80(Third Revised ed.).Sybex.pp. 67, 120, 609.ISBN0-89588-094-6.Retrieved2023-03-05.
  5. ^abcdeHarris, David; Harris, Sarah L. (2007).Digital Design and Computer Architecture.Morgan Kaufmann Publishers.ISBN978-0-12-370497-9.Retrieved2023-03-05.
  6. ^abcdeJacob, Matthias; Jakubowski, Mariusz H.;Venkatesan, Ramarathnam[at Wikidata](20–21 September 2007).Towards Integral Binary Execution: Implementing Oblivious Hashing Using Overlapped Instruction Encodings(PDF).Proceedings of the 9th workshop on Multimedia & Security (MM&Sec '07). Dallas, Texas, US:Association for Computing Machinery.pp. 129–140.CiteSeerX10.1.1.69.5258.doi:10.1145/1288869.1288887.ISBN978-1-59593-857-2.S2CID14174680.Archived(PDF)from the original on 2018-09-04.Retrieved2021-12-25.(12 pages)
  7. ^Lagarias, Jeffrey "Jeff" Clark;Rains, Eric Michael;Vanderbei, Robert J.(2009) [2001-10-13]. Brams, Stephen; Gehrlein, William V.; Roberts, Fred S. (eds.). "The Kruskal Count".The Mathematics of Preference, Choice and Order. Essays in Honor of Peter J. Fishburn.Berlin / Heidelberg, Germany:Springer-Verlag:371–391.arXiv:math/0110143.ISBN978-3-540-79127-0.(22 pages)
  8. ^Andriesse, Dennis;Bos, Herbert[at Wikidata](2014-07-10). Written at Vrije Universiteit Amsterdam, Amsterdam, Netherlands. Dietrich, Sven (ed.).Instruction-Level Steganography for Covert Trigger-Based Malware(PDF).11thInternational Conference on Detection of Intrusions and Malware, and Vulnerability Assessment(DIMVA).Lecture Notes in Computer Science.Egham, UK; Switzerland:Springer International Publishing.pp. 41–50 [45].doi:10.1007/978-3-319-08509-8_3.eISSN1611-3349.ISBN978-3-31908508-1.ISSN0302-9743.S2CID4634611.LNCS 8550.Archived(PDF)from the original on 2023-08-26.Retrieved2023-08-26.(10 pages)
  9. ^abJakubowski, Mariusz H. (February 2016)."Graph Based Model for Software Tamper Protection".Microsoft.Archivedfrom the original on 2019-10-31.Retrieved2023-08-19.
  10. ^Jämthagen, Christopher (November 2016)."On Offensive and Defensive Methods in Software Security"(PDF)(Thesis). Lund, Sweden: Department of Electrical and Information Technology,Lund University.p. 96.ISBN978-91-7623-942-1.ISSN1654-790X.Archived(PDF)from the original on 2023-08-26.Retrieved2023-08-26.(1+xvii+1+152 pages)
  11. ^ab"Unintended Instructions on x86".Hacker News.2021.Archivedfrom the original on 2021-12-25.Retrieved2021-12-24.
  12. ^Kinder, Johannes (2010-09-24).Static Analysis of x86 Executables[Statische Analyse von Programmen in x86 Maschinensprache](PDF)(Dissertation). Munich, Germany:Technische Universität Darmstadt.D17.Archivedfrom the original on 2020-11-12.Retrieved2021-12-25.(199 pages)
  13. ^"What is" overlapping instructions "obfuscation?".Reverse Engineering Stack Exchange.2013-04-07.Archivedfrom the original on 2021-12-25.Retrieved2021-12-25.
  14. ^Gates, William "Bill" Henry,Personal communication(NB. According toJacob et al.)
  15. ^Shacham, Hovav (2007).The Geometry of Innocent Flesh on the Bone: Return-into-libc without Function Calls (on the x86)(PDF).Proceedings of the ACM, CCS 2007.ACM Press.Archived(PDF)from the original on 2021-12-15.Retrieved2021-12-24.
  16. ^"Managed, Unmanaged, Native: What Kind of Code Is This?".developer.com.2003-04-28.Retrieved2008-09-02.
  17. ^Tanenbaum, Andrew S. (1990).Structured Computer Organization, Third Edition.Prentice Hall. p.398.ISBN978-0-13-854662-5.
  18. ^"Associated Data Architecture".High Level Assembler and Toolkit Feature.
  19. ^"COBOL SYSADATA file contents".Enterprise COBOL for z/OS.
  20. ^"SYSADATA message information".Enterprise PL/I for z/OS 6.1 information.
  21. ^"Symbols for Windows debugging".Microsoft Learn.
  22. ^"Querying the.Pdb File".Microsoft Learn.

Further reading[edit]