This is how:

Get rid of strlen(), scanf(), etc. (assuming that the string ends in 0x0a00)

1. PCMPEQB 0x00, PCMPEQD 0x00 + loop while MOVMSKPS!=0

2. PCMPEQB 0x0a, PCMPEQD 0x00 + loop while MOVMSKPS!=0

Do the entire conversion in SSE registers:

1. Load 16 bytes using MOVUPS, save to 'Y', then PSUBUSB 0x30 from all bytes ('0'-'9' will be #0-#9 and 0x0a will be also be #0). Call this value 'X'.

2. Use MOVAPSs and ANDPSs on 'X' until you end up with one digit in each dword of 4 different SSE registers and convert to float using CVTDQ2PS.

3. Multiply these 4 registers by 4 different vectors:

Code:

[10^0, 10^1/256, 10^2/256^2, 10^3/256^3]
[10^4, 10^5/256, 10^6/256^2, 10^7/256^3]
[10^8, 10^9/256, 10^10/256^2,10^11/256^3]
[10^12,10^13/256,10^14/256^2,10^15/256^3]

4. Accumulate results but make sure that values are less than 10^7 apart in magnitude (single floats have 24 bits in the mantissa, which is a little more than than 10^7). Do this using ADDPS & HADDPS, replacing a couple of above MOVs with PSHUFs, and adjusting the 4 factors accordingly.

5. Convert the resulting 4 floats to doubles (CVTPS2PD) and accumulate using HADDPD twice.

6. Independently (so this doesn't hurt at all!) from doing lots of work on 'X', we do magic on 'Y' to locate the index of the highest '0x0a', e.g. with PCMPBEQ '0x0a', followed by e.g. 2-4 bitscans in GPRs. Call the result 'Z'.

7. Use 'Z' to index a 15-entry table containing 10^-1, 10^-2, 10^-3...10^-16, and do a corrective multiply followed by a conversion to int (there is no instructions to do this, so you have to do this with PSUBD, 2xPSLLRQ, ANDPS, ORPS). Since the maximum result only has 48-bits the 53-bit mantissa is sufficient.

8. Add 'Z' to pointer...done. Next value please!

The Dependency chain

sub-and-convert-mul-add-add-convert-add-add-mul-convert

is ~35 cycles. Depending on how fast the address of the next value is determined (loop dependency) and therefore how much code can therefore possibly be done OOO, the optimal case is probably ~10 cycles and will be really tough to achieve, but doing all this in 20 cycles shouldn't require any additional tricks. I think I'd need about 16 hours of programming to do it.

Have fun