Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

It's helpful to realize x86 assembly is not what's executed by the machine; machine code is. One assembly instruction, e.g. ADDL, is translated to several different machine code instructions depending on the destination, source, and addressing mode.


Can you point to a source for this? All x86 assemblers that I know of map one assembly instruction to one machine instruction.


I'm looking at the Microsoft Macro Assembler 5.1 Reference manual (it was nearby and easily accessible to me; yes it's old (from the very late 80s or early 90s) but it covers the 32 bit 80386, which is still valid.

Anyway, it shows three different encodings for the ADD instruction. The first:

    000000dw mod,reg,r/m
This adds register to register, or memory to register (either direction, the d above) using either 8 or 16/32 bits (the w above [1]). The second form:

    100000sw mod,000,r/m
This adds an immediate value (8 or 16 bits, w again) to a register. The s bit is used to sign extend the data (s=1; otherwise, 0-extend it) if required [2]. The final form:

    0000010w data
This adds an immediate value to the accumulator register (EAX, AX, AL) [1]. That's three different encodings for the "same" instruction. The MOV instruction (and again, I'm only talking about the 80386 here) has 8 different encodings, depending upon registers used.

[1] If the current code segment is designated as a 16-bit segment, then the w means 16 bits, unless a size override byte (an opcode prefix byte) is present, in which case it means 32-bits. If the current code segment is designated as a 32-bit segment, then the w means 32 bits, again unless a size override byte is present, in which case it means 16-bits.

[2] It seems to me that if w=1, then the s bit is extraneous and thus could be used to encode other instructions. I'm not sure if that is the case but it's common to use otherwise nonsensical instruction encoding to do something useful.


It seems to me that if w=1, then the s bit is extraneous and thus could be used to encode other instructions. I'm not sure if that is the case but it's common to use otherwise nonsensical instruction encoding to do something useful.

Opcode 82h is an alias for 80h --- it presumably sign-extends the immediate value into an internal temporary register, but the upper bits don't matter anyway since it's an 8-bit add. Some interesting discussion on that here, along with an example application:

http://computer-programming-forum.com/46-asm/143edbd28ae1a09...


Here are the opcodes for the x86 ADD assembler instruction:

http://www.mathemainzel.info/files/x86asmref.html#add

The link shows nine ways to use the ADD instruction with each method resulting in a different opcode.


There's a bit of a miscommunication going on. What I meant was, when you write an assembly instruction, that maps to one machine instruction.

I.e., because of things like addressing modes, different invocations of an ADD instruction can map to different machine instructions. But one ADD invocation will always map to one machine instruction.

The parent comment sounded to me like one assembly instruction could map to several machine instructions, like one line of C is equivalent to several lines of assembly. Just wanted to clarify that that isn't the case.


> There's a bit of a miscommunication going on. What I meant was, when you write an assembly instruction, that maps to one machine instruction.

That's not true on x86-16, x86-32 and x86-64. For example

  060o, 310o
and

  062o, 301o
(...o means "octal"; for the reason why I give this example in octal instead of hexadecimal cf. https://news.ycombinator.com/item?id=13051770) both stand for "xor al, cl" (the assembler you use will one of the two encodings) - for those people who really prefer hexadecimal here: It corresponds to

  30h, C8h
and

  32h, C1h
The fact that there are different ways to encode some instructions was used by the A86 assembler (https://en.wikipedia.org/w/index.php?title=A86_(software)&ol...) to watermark machine code that was generated by it; in particular to detect whether it was generated by a registered or unregistered version of A86:

"The assembler automatically embeds a "fingerprint" into the generated code through a particular choice of functionally equivalent instruction encodings. This makes it possible to tell if code was assembled with A86, and also to distinguish between registered and unregistered versions of the assembler, although access to the source code is required."


Re-read my previous comment. An assembly instruction can map to one of a set of machine instructions, but only one.

Said another way, when you write:

MOV eax, 5

This will map to _either_:

110111 _or_ 110110, but _not_ both in sequence.


I read it as one assembly instruction can map to one of several machine instructions.


That is also true --- for example, "mov reg, reg" is a special case of "mov reg, r/m" or "mov r/m, reg" with the r/m specifying a register, so basically two separate sequences of bytes which perform the same operation. This has been exploited by copy-protection and steganography, going back to the A86 shareware assembler which was the first use of this technique that I can remember, to more recent developments:

https://www.cs.columbia.edu/~angelos/Papers/hydan.pdf

http://stackoverflow.com/questions/17973103/why-does-the-sol...

(Almost wish that last link was cut off one letter earlier...)



>Can you point to a source for this? All x86 assemblers that I know of map one assembly instruction to one machine instruction.

Seeing the confusion and clarifications in the replies to your comment, I think it may have been more clear if you had said (and you probably meant):

"Can you point to a source for this? All x86 assemblers that I know of map one assembly instruction to one out of a set of machine instructions (where the chosen machine instruction depends on things like the addressing mode (immediate, indexed, indirect indexed, etc. - I'm using older terms for addressing mode, not sure if they are valid now with newer processors. but the concept is the same).


I suspect the instructions they are thinking about are cpu-level uops.


There would be much less confusion about this in the descendant posts if this said "One assembly mnemonic is translated to ..."




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: