基本上,我很难使执行时间比实际时间要短,并且要减少时钟周期和内存大小.有谁知道我该怎么做吗?代码工作正常,我只想稍作更改.
编写了有效的代码,但又不想弄乱代码,也不知道要进行哪些更改.
; Calculation of a factorial value using a simple loop ; set up the exception addresses THUMB AREA RESET, CODE, READONLY EXPORT __Vectors EXPORT Reset_Handler __Vectors DCD 0x00180000 ; top of the stack DCD Reset_Handler ; reset vector - where the program starts AREA 2a_Code, CODE, READONLY Reset_Handler ENTRY start MOV r1,#0 ; count the number of multiplications performed MOV r2,#3 ; the final value in the factorial calculation MOV r3,#1 ; the factorial result will be stored here ; loop r2 times forming the product fact ADD r1,r1,#1 ; find the next multiplicand MUL r3,r1,r3 ; form the next product - note that MUL r3,r3,r1 gives unpredictable output CMP r1,r2 ; check if the final value has been reached BMI fact ; continue if all products have not been formed exit ; stay in an endless loop B exit END当前结果是: 记忆体大小:0x00000024 时钟周期:22 总执行时间:1.1微秒
我们正在使用Cortex M3
我只需要减少其中的任何一个,只要代码产生不同的结果,对代码的更改就可以很小.
解决方案通常,代码大小和性能是一个折衷方案.展开循环通常可以提高性能(至少对于大型输入而言),但是在循环外部需要额外的逻辑来处理清除操作,等等.
大部分答案是假设使用诸如Cortex-A9或Cortex-A53之类的高性能CPU,其中通过软件流水线创建指令级并行性将很有帮助. Cortex M3是标量的,具有单周期乘法指令,因此优化起来要简单得多.(最初的问题没有指定内核,我期望即使是低端CPU也会具有多周期mul延迟.我在写完后才发现了Cortex-M3编号.)
您的代码可能会成为整数乘法延迟的瓶颈.与add不同,add在下一个周期准备好结果,而mul则很复杂,需要多个周期才能产生结果.
(除了某些时钟非常慢的芯片上,例如Cortex-M3显然具有1周期的mul指令.但是 Cortex-M0/M0 +/M23可为该指令选择1个周期或32个周期的性能!慢速迭代=较小的硅.)
乘法执行单元本身通常是流水线式的,因此可以一次运行多个独立乘法,但是阶乘循环需要将每个乘法结果作为下一次迭代的输入. (仅适用于高性能内核,而不是Cortex-M系列.慢速Cortex-M芯片上的32周期乘法是迭代的,并且可能没有流水线,因此在运行时无法启动另一个乘法,因此没有任何好处.除了减少循环开销之外,还可以公开任何指令级并行性.
请注意,乘法是关联的:1 * 2 * 3 = 3 * 2 * 1,因此我们可以从n开始倒数,正如@ensc的答案所指出的那样.或(1*2) * (3*4) = 1*2*3*4.
我们可以与n/2+1 * n/2+2 * n/2+3 * ... * n并行执行1 * 2 * ... * (n/2),在这两个依赖项链上进行交织.或者,我们可以在执行n -= 2并据此计算n+1. (然后,最后将这两个乘积相乘.)
这显然将需要更多的代码大小,但可能会大大提高性能.
当然,查找表是另一个解决方法.如果您只关心不会溢出32位结果的输入,那将是一个很小的表.但这会带来巨大的尺寸成本.
即使在顺序CPU(指令执行必须按程序顺序启动)上,也可能允许长时间运行的指令(如高速缓存未命中负载或乘法)完成 乱序,例如在启动mul之后但回写mul结果之前,某些add指令可能会运行.甚至在更早的mul等待时间的阴影下启动另一个独立的mul指令.
我搜索了一些ARM性能数据,以了解典型的性能.
例如, Cortex-A9 是一个较早的,相当常见的高端ARMv7.超标量(每个周期有多条指令)的CPU,具有无序执行.
mul需要2个周期,并有4个周期的结果延迟时间.他们没有解释非延迟成本的含义.也许这就是执行单元的互惠吞吐量,例如您可以多久启动一次新的独立操作.这是一个乱序的CPU,因此将其他指令停顿2个周期没有任何意义.在 NEON SIMD指令部分,他们解释看起来像相同的周期"数字:
这是特定指令消耗的发布周期数,如果没有操作数互锁,则是每条指令的绝对最小周期数.
(操作数互锁=如果较早的指令尚未产生结果,则等待输入操作数就绪).
(Cortex-A9确实支持压缩整数乘法,因此对于大型阶乘,您可以使用 vmul.32 q1, q1, q2 .或者使用64位d寄存器每2个周期2个,但是那么您需要更多的vadd指令,而与乘法不同,vadd.32在128位q regs上的运行速度与在64位向量上一样快.因此,SIMD可以为您在Cortex-S上标量的乘积提供两倍的吞吐量. A9,如果您使用足够的寄存器来隐藏较大的延迟,但是SIMD可能仅对n有用,以至于n!如此大,以致n!溢出32位整数,因此得到模2 ^ 32的结果. > 更低延迟的ARM乘法指令:
mul 是32x32 => 32位乘.在Cortex-A9上,它具有2c的吞吐量和4c的延迟.
(muls在Thumb模式下是16位指令,除非不需要消除标志位,否则应首选使用.mul在Thumb模式下仅在ARMv6T2及更高版本中可用.)
smulbb 是16x16 => 32位带符号乘法,仅读取其输入的下半部分,但在A9上具有 1c吞吐量和3c延迟. (BB =底部,底部.也可以使用其他组合,以及乘累加和各种时髦的东西.)
没有smulxy的2个字节的Thumb版本,因此对于代码大小而言,这要比muls更糟糕.
不幸的是,smulxy在无符号版本中不可用,因此将我们可以使用的输入范围限制为正数int16_t,而不是uint16_t.
但是,如果我们只关心最终的32位结果没有溢出的情况,我们可以安排操作顺序,以便最后的乘法具有2个大小相似的输入(均为大的16位数字) .即尽可能接近sqrt(n!).所以赔率和偶数的乘积是合理的,但(n-1)! * n将是最坏的情况,因为这将要求(n-1)!适应16位.实际上,最坏的情况是从n开始倒数,因此最后一个是乘以3再乘以2.我们可以将乘以2的特殊情况乘以左移...
将这些部分放在一起,请注意乘以1是无操作的操作(smulbb除外,在该操作中它将输入截断为16位).因此,我们可以根据输入是奇数还是偶数,以一种在乘以1或2之后停止的方式展开.
因此,我们只知道lo(以n-1开头)和hi(以n开头),而不是知道奇数和偶数.
;; UNTESTED, but it does assemble with the GNU assembler, after sed -i 's/;/@/' arm-fact.S ;; and replacing THUMB with ; .thumb ; .syntax unified THUMB ;; Input: n in r0. (n is signed positive, otherwise we return n.) ;; Output: n! in r0. ;; clobbers: r1, r2, r3 ;; pre-conditions: n! < 2^31. Or maybe slightly lower. fact: subs r3, r0, #3 ; r3 = lo = n-3 (first multiplier for loprod) bls .Ltiny_input subs r2, r0, #2 ; r2 = hi = n-2 (first multiplier for hiprod) subs r1, r0, #1 ; r1 = loprod = n-1 ; r0 = hiprod = n .Lloop: ; do { smulbb r0,r0, r2 ; hiprod *= hi subs r2, #2 ; hi -= 2 for next iter smulbb r1,r1, r3 subs r3, #2 ; lo -= 2 for next iter bgt .Lloop ; while((lo-=2) > 0); signed condition ; r3 = 0 or -1, r2 = 1 or 0. The last multiplies were: ; hiprod *= 2 and loprod *= 1 for even n ; or hiprod *= 3 and loprod *= 2 for odd n ; muls r0, r1 smulbb r0,r0, r1 ; return hiprod *= loprod bx lr ; or inline this .Ltiny_input: ; alternate return path for tiny inputs ; r0 = n. flags still set from n - 3 IT eq ; GAS insists on explicit IT for thumb mode moveq r0, #6 ; 3! = 6, else n! = n for smaller n=1 or 2. ; 0! = 1 case is not handled, nor are negative inputs bx lr((标签名中的.L使其成为一个本地标签,至少在GAS语法中没有显示在目标文件中.如果使用的是汇编器,则可能不在ARMASM中.)
ARM汇编使您可以忽略目的地(与第一个来源相同),例如subs而不是smulbb.如果需要,您每次都可以像subs r2, r2, #2一样写出来.
您可能会在最终产品中使用muls r0, r1 ,因为最终的hiprod会比loprod高一点.即使hiprod> max int16_t,产品也可能不会溢出.这也将节省2个字节的代码大小,但会在Cortex-A9上增加1个周期的延迟. (顺便说一句,ARMv6用mul d,d, src怪异性修复了不可预测的结果",并且您的代码使用32位Thumb2指令,因此无论如何它仅适用于ARMv6T2及更高版本.)
对于产品有2个累加器,它可以在Cortex-A9上每3个周期以2倍的速度运行,这在很大程度上取决于CPU微体系结构及其前端是否可以保持.在有序ARM上,我担心它能否在乘法完成之前启动其他指令.
最好在sub上花费2个额外的字节,而不是在subs上,以便我们可以在分支之前对标志进行计算 ,也许可以减少分支错误预测的代价并避免停顿在有序CPU上. smulbb不触摸标志,因此我们可以先执行loprod并使hi内容不触摸标志.
.loop: ; do { smulbb r1, r3 ; loprod *= lo subs r3, #2 ; lo -= 2 for next iter, and set flags smulbb r0, r2 ; hiprod *= hi sub r2, #2 ; hi -= 2 for next iter (no flags) bgt .loop ; while((lo-=2) >= 0);请注意,我们在 smulbb读取它们后在之后修改了r3和r2,避免了因顺序芯片上的数据依赖性而造成停顿.
您正在使用Thumb模式并针对代码大小进行优化,因此了解哪种形式的指令可以使用2字节/16位编码以及哪种形式仅可用于32位Thumb2至关重要.编码.
subs Rd, Rn, #imm 可以编码为16位Thumb指令for imm = 0..7 (3位立即数).或者对于imm = 0..255,使用与src和目标相同的寄存器.因此,我的复制和订阅说明紧凑.
无标志设置sub不能是16位指令,除非在IT块内部或以SP作为操作数.
拇指模式下的预定指令(例如moveq r0, #6)要求汇编器使用 IT指令,以引入下一条最多4条指令的谓词.在ARM模式下,每个指令的高4位表示预测. (如果您不使用后缀,则汇编程序会将其编码为始终(即不带谓词).)
我们可以使用cmp r0,#0/moveq r0, #1处理另外4或6个字节的n==0情况.如果将tst/mov放在同一IT块中,则可能将其减少到4个字节. IT不会快照实际的标志条件,而是快照谓词,因此IT块中的标志设置指令可能会对同一块中的后续指令产生影响. (我认为这是正确的,但我不确定100%).
tiny_input: ; r0 = n, flags set according to n-3 ITET EQ moveq r0, #6 cmpne r0, #0 moveq r0, #1或者 16位cbnz 可以有条件地跳过mov r0, #1.但是分支目标必须在cbnz之后从4到130个字节,因此,显然我们不能仅跳过一条16位指令!
我的版本的代码大小: $ arm-none-eabi-gcc -g -c -mcpu=cortex-a9 arm-fact.S $ arm-none-eabi-objdump -drwC arm-fact.o arm-fact.o: file format elf32-littlearm Disassembly of section .text: 00000000 <fact>: 0: 1ec3 subs r3, r0, #3 2: d90b bls.n 1c <.tiny_input> 4: 1e82 subs r2, r0, #2 6: 1e41 subs r1, r0, #1 00000008 <.loop>: 8: fb10 f002 smulbb r0, r0, r2 c: 3a02 subs r2, #2 e: fb11 f103 smulbb r1, r1, r3 12: 3b02 subs r3, #2 14: dcf8 bgt.n 8 <.loop> 16: fb10 f001 smulbb r0, r0, r1 1a: 4770 bx lr 0000001c <.tiny_input>: 1c: bf08 it eq 1e: 2006 moveq r0, #6 20: 4770 bx lr因此,此功能的大小为0x22字节. (如果要处理0! = 1,则为0x26.)
它比您的版本大(您的字节数包括内存中的一些常量以及mov指令来产生输入),但是从理论上讲,对于具有流水线乘法器的CPU,大输入的速度可能要快两倍.对于从1到3的输入来说,它可能要快得多,它只分支一次并产生结果.
您可能没有Cortex-A9之类的东西,因为1.1微秒= 22个时钟周期意味着20MHz的时钟速度,而Cortex-A9的可用频率为0.8至2GHz.
>因此,也许您有一个更简单的有序内核,例如 Cortex M3 ? M3确实支持mul指令和Thumb2模式.维基百科说它的乘法是1个周期!因此,这很奇怪,我很惊讶它具有如此高效的乘数.或是时钟太慢,以至于在1级中会有大量的门延迟,而这只是3级流水线.
Cortex-M3版本: subs和muls在Cortex-M3上是单周期的.我没有在树枝上找到性能数字,但是它们很常见,因此我假设它可能是1个周期,并且不会引起较大的读取泡沫(如果正确预测……). Cortex-M3 HTML手册的分支目标转发,这似乎与减少获取泡沫有关.其指令时序表显示b<cond>花费1个周期(不使用)或2个周期(使用). (1表示分支,1表示在立即移位后重新加载管道.).因此,与sub/mul相比,采用的分支慢,展开很有价值,因此上面的代码仍然可以正常工作. (但是不需要多个产品累加器,因此可以简化.)
优化代码大小: ;; UNTESTED THUMB ;; Input: n in r0. (n is signed positive, otherwise we return n.) ;; Output: n! in r0. ;; clobbers: r1 fact: subs r1, r0, #1 ; i = n-1 bls .Ltiny_input ; jump if n<=1 .Lloop: ; do { muls r0, r1 ; prod *= i subs r1, #1 ; --i bgt .Lloop ; while(--i > 0); signed condition ; r1 = 0, r0 = n! ; last multiply was a redundant prod *= 1 but avoiding that would take a cmp .Ltiny_input: ; alternate return path for tiny inputs ; 0! = 1 case is not handled, nor are negative inputs bx lr ; or inline this我认为这是我们可以管理的最小规模.该循环有3条指令,每次迭代可能要花费4个周期(1 + 1 + 2,所采用的分支要花费2个周期).
00000000 <fact>: 0: 1e41 subs r1, r0, #1 2: d902 bls.n a <fact+0xa> 4: 4348 muls r0, r1 6: 3901 subs r1, #1 8: dcfc bgt.n 4 <fact+0x4> a: 4770 bx lr # don't count this if inlining因此,这是0xa = 10个字节,不计算bx lr返回指令.
我们可以在分支的第一个subs之后 后使用IT块处理0! = 1情况,因此我们仍然可以在跳转后立即跳转循环(而不是像我的Cortex-A9版本那样到达单独的块).不过,您也可以使用此技巧.
subs r1, r0, #1 ; i = n-1 it lt movlt r0, #1 ; n = 1 for n<1 bls .Ltiny_input ; return n if n was <=1如果分支需要更大的范围,则可以使用itt ls/movls r0, #1,因此分支位于IT块内(其中分支指令可以使用一种编码,该编码在位移上花费更多的位,而在谓词上不花费更多的位).但是在这种情况下,它的范围很短,因此在r0 == 1情况下,我选择不修改.我不知道是否有任何CPU使谓词指令成为NOP而不是运行它的效率更高或延迟更低,但是可能会有.
如果不展开,将cmp放入循环中以避免最后一次*=1迭代将使我们每次迭代花费一个额外的周期(4个周期而不是3个周期),因此只能使用n=2或n=3.
展开可帮助大幅提高较大输入的速度,从每3个周期1 mul到每2周期渐近1 mul (sub + mul +摊销循环开销).我看不出有什么方法可以避免像sub或mov这样的指令为每个mul生成单独的输入,除非通过对每个n的特殊情况序列进行硬编码(例如*2 * 4 = *8 =向左移动3),而您可以硬编码答案.
Basically I'm having a hard time getting the execution time any lower than it is, as well as reducing the amount of clock cycles and memory size. Does anyone have any idea on how I can do this? The code works fine I just want to change it a bit.
Wrote a working code, but don't want to mess up the code, but also don't know what changes to make.
; Calculation of a factorial value using a simple loop ; set up the exception addresses THUMB AREA RESET, CODE, READONLY EXPORT __Vectors EXPORT Reset_Handler __Vectors DCD 0x00180000 ; top of the stack DCD Reset_Handler ; reset vector - where the program starts AREA 2a_Code, CODE, READONLY Reset_Handler ENTRY start MOV r1,#0 ; count the number of multiplications performed MOV r2,#3 ; the final value in the factorial calculation MOV r3,#1 ; the factorial result will be stored here ; loop r2 times forming the product fact ADD r1,r1,#1 ; find the next multiplicand MUL r3,r1,r3 ; form the next product - note that MUL r3,r3,r1 gives unpredictable output CMP r1,r2 ; check if the final value has been reached BMI fact ; continue if all products have not been formed exit ; stay in an endless loop B exit ENDThe current results are: Memory Size: 0x00000024 Clock Cycles: 22 Total Execution Time:1.1 Micro seconds
We are working with the Cortex M3
I just need any of these to be reduced, the changes to the code can be minor as long as it produces different results.
解决方案Often code-size and performance are a tradeoff. Unrolling a loop often helps performance (for large inputs at least), but requires extra logic outside the loop to handle the cleanup and so on.
Most of this answer was assuming a higher-performance CPU like Cortex-A9 or Cortex-A53 where software pipelining to create instruction-level parallelism would be helpful. Cortex M3 is scalar and has a single-cycle multiply instruction, making it much simpler to optimize for.(The original question didn't specify a core, and I was expecting that even low-end CPUs would have multi-cycle mul latency. I only found Cortex-M3 numbers after writing it.)
Your code will probably bottleneck on the latency of integer multiply. Unlike add, where the result will be ready the next cycle, mul is complex and takes multiple cycles to produce a result.
(Except on some very slowly-clocked chips, like apparently Cortex-M3 has a 1-cycle mul instruction. But Cortex-M0/M0+/M23 are available with a choice of 1 cycle or 32 cycle performance for that instruction! Slow iterative = smaller silicon.)
The multiply execution unit itself is often pipelined so multiple independent multiplies can be in flight at once, but your factorial loop needs each multiply result as an input to the next iteration. (Only for higher-performance cores, not Cortex-M series. The 32-cycle multiply on slow cortex-M chips is iterative and presumably not pipelined, so another multiply couldn't start while it's running, and there'd be no benefit to exposing any instruction-level parallelism beyond reducing loop overhead.)
Notice that multiplication is associative: 1 * 2 * 3 = 3 * 2 * 1, so we can count down from n, as @ensc's answer points out. Or (1*2) * (3*4) = 1*2*3*4.
We could instead do 1 * 2 * ... * (n/2) in parallel with n/2+1 * n/2+2 * n/2+3 * ... * n, interleaving work on those two dependency chains. Or we could interleave 1 * 3 * 5 * ... * n with 2 * 4 * 6 * ... n-1, in a loop that did n -= 2 and calculates n+1 from that. (Then at the end, you multiply those 2 products).
This is obviously going to require more code-size, but could help performance a lot.
Of course, a lookup table is another workaround. If you only care about inputs that don't overflow a 32-bit result, that's a pretty small table. But that has a significant size cost.
Even on an in-order CPU (where instruction execution has to start in program order), long-running instructions like cache-miss loads, or multiplies, may be allowed to complete out of order, so e.g. some add instructions could run after starting a mul but before the mul result was written back. Or even starting another independent mul instruction in the shadow of an earlier mul's latency.
I googled some ARM performance numbers to maybe get a feel for what's typical.
For example, Cortex-A9 is an older fairly common high-end ARMv7 CPU that is superscalar (multiple instructions per cycle) with out-of-order execution.
mul "takes" 2 cycles, and has 4 cycle result latency. They don't explain what they mean by the non-latency cost. Perhaps that's the reciprocal throughput of the execution unit, like how often you can start a new independent operation. It's an out-of-order CPU so it doesn't make sense for it to stall other instructions for 2 cycles. In the NEON SIMD instruction section, they explain what looks like the same "cycles" number:
This is the number of issue cycles the particular instruction consumes, and is the absolute minimum number of cycles per instruction if no operand interlocks are present.
(operand interlocks = waiting for an input operand to be ready, if an earlier instruction hasn't produced a result yet).
(Cortex-A9 does support packed integer multiplication, so for large factorials you could look at doing 4 multiplies in parallel starting one vector per 4 cycles, using vmul.32 q1, q1, q2. Or 2 per 2 cycles with 64-bit d registers, but then you'd need more vadd instructions and unlike multiply, vadd.32 is just as fast with 128-bit q regs as with 64-bit vectors. So SIMD can give you twice the multiply throughput of scalar on Cortex-A9, if you use enough registers to hide the large latency. But SIMD would probably only be useful with n so large that n! overflows a 32-bit integer, so you get a result modulo 2^32.)
Lower latency ARM multiply instructions:mul is a 32x32 => 32-bit multiply. On Cortex-A9, it has 2c throughput and 4c latency.
(muls is a 16-bit instruction in thumb mode, and should be preferred unless you need to not clobber the flags. mul in Thumb mode is only available in ARMv6T2 and later.)
smulbb is a 16x16 => 32-bit signed multiply that only reads the low half of its inputs, but has 1c throughput and 3c latency on A9. (BB = bottom, bottom. The other combinations are also available, along with multiply-accumulate and various funky things.)
There is not 2-byte Thumb version of smulxy, so this is worse for code-size than muls.
Unfortunately smulxy isn't available in an unsigned version, so that limits the range of inputs we can use it with to positive int16_t, not uint16_t.
But if we only care about the case where the final 32-bit result doesn't overflow, we can arrange our order of operations so the last multiply has 2 inputs of similar magnitude (both large-ish 16-bit numbers). i.e. as close to sqrt(n!) as possible. So e.g. the product of odds and evens would be reasonable, but (n-1)! * n would be the worst case because that would require (n-1)! to fit in 16 bits. Actually the worst case would be counting down from n so the last one is a multiply by 3 then 2. We could special case the multiply by 2 to a left shift...
Putting these pieces together, notice that multiplying by 1 is a no-op (except with smulbb where it truncates the input to 16 bit). So we can unroll in a way that stops after a multiply by 1 or 2 depending on the input being odd or even.
So instead of knowing which is odd and which is even, we just have lo (starting with n-1) and hi (starting with n).
;; UNTESTED, but it does assemble with the GNU assembler, after sed -i 's/;/@/' arm-fact.S ;; and replacing THUMB with ; .thumb ; .syntax unified THUMB ;; Input: n in r0. (n is signed positive, otherwise we return n.) ;; Output: n! in r0. ;; clobbers: r1, r2, r3 ;; pre-conditions: n! < 2^31. Or maybe slightly lower. fact: subs r3, r0, #3 ; r3 = lo = n-3 (first multiplier for loprod) bls .Ltiny_input subs r2, r0, #2 ; r2 = hi = n-2 (first multiplier for hiprod) subs r1, r0, #1 ; r1 = loprod = n-1 ; r0 = hiprod = n .Lloop: ; do { smulbb r0,r0, r2 ; hiprod *= hi subs r2, #2 ; hi -= 2 for next iter smulbb r1,r1, r3 subs r3, #2 ; lo -= 2 for next iter bgt .Lloop ; while((lo-=2) > 0); signed condition ; r3 = 0 or -1, r2 = 1 or 0. The last multiplies were: ; hiprod *= 2 and loprod *= 1 for even n ; or hiprod *= 3 and loprod *= 2 for odd n ; muls r0, r1 smulbb r0,r0, r1 ; return hiprod *= loprod bx lr ; or inline this .Ltiny_input: ; alternate return path for tiny inputs ; r0 = n. flags still set from n - 3 IT eq ; GAS insists on explicit IT for thumb mode moveq r0, #6 ; 3! = 6, else n! = n for smaller n=1 or 2. ; 0! = 1 case is not handled, nor are negative inputs bx lr(.L in a label name makes it a local label that doesn't show up in the object file, at least in GAS syntax. Maybe not in ARMASM, if you're using that assembler.)
ARM assembly lets you leave out the destination when it's the same as the first source, for some instructions like subs but not smulbb. You could write it out like subs r2, r2, #2 every time if you want.
You might use muls r0, r1 for the final product, because the final hiprod is a bit higher than loprod. The product might not overflow even if hiprod > max int16_t. That would save 2 bytes of code-size, too, but add 1 cycle of latency on Cortex-A9. (BTW, ARMv6 fixed the "unpredictable result" with mul d,d, src weirdness, and your code used 32-bit Thumb2 instructions, thus it only works on ARMv6T2 and above anyway.)
With 2 accumulators for the products, this can possibly run at 2 multiplies per 3 cycles on Cortex-A9, depending greatly on the CPU micro-architecture and whether its front-end can keep up. On an in-order ARM, I'd be worried about it being able to start other instructions before a multiply finished.
It might be better to spend 2 extra bytes on sub instead of subs so we can compute the flags a couple instructions ahead of the branch, maybe reducing branch mispredict penalty and avoiding stalls on in-order CPUs. smulbb doesn't touch flags, so we can do loprod first and have the hi stuff not touch flags.
.loop: ; do { smulbb r1, r3 ; loprod *= lo subs r3, #2 ; lo -= 2 for next iter, and set flags smulbb r0, r2 ; hiprod *= hi sub r2, #2 ; hi -= 2 for next iter (no flags) bgt .loop ; while((lo-=2) >= 0);Note that we're modifying r3 and r2 right after smulbb reads them, avoiding creating a stall for the data dependency on in-order chips.
You're using Thumb mode and optimizing for code-size, so it's important to know which forms of which instructions can use a 2-byte / 16-bit encoding and which are only available as 32-bit Thumb2 encodings.
subs Rd, Rn, #imm can be encoded as a 16-bit Thumb instruction for imm=0..7 (3-bit immediate). Or with the same register as src and destination, for imm=0..255. So my copy-and-sub instructions are compact.
Non-flag-setting sub can't be a 16-bit instruction except inside a IT block, or with SP as the operand.
Predicated instructions in Thumb mode, like moveq r0, #6, require the assembler to use an IT instruction to introduce predication for the next up-to-4 instructions. In ARM mode, the top 4 bits of every instruction signals predication. (If you don't use a suffix, the assembler encodes it as ALways, i.e. not predicated.)
We could handle the n==0 case with another 4 or 6 bytes, with cmp r0,#0 / moveq r0, #1. Maybe getting it down to 4 bytes if we put the tst / mov inside the same IT block. IT doesn't snapshot the actual flag condition, it snapshots which predicate, so flag-setting instructions inside an IT block can have an effect on later instructions in the same block. (I think this is right, but I'm not 100% sure).
tiny_input: ; r0 = n, flags set according to n-3 ITET EQ moveq r0, #6 cmpne r0, #0 moveq r0, #1Or there's 16-bit cbnz to conditionally jump over a mov r0, #1. But the branch target must be from 4 to 130 bytes after the cbnz, so we can't jump over just a single 16-bit instruction, apparently!
Code-size for my version: $ arm-none-eabi-gcc -g -c -mcpu=cortex-a9 arm-fact.S $ arm-none-eabi-objdump -drwC arm-fact.o arm-fact.o: file format elf32-littlearm Disassembly of section .text: 00000000 <fact>: 0: 1ec3 subs r3, r0, #3 2: d90b bls.n 1c <.tiny_input> 4: 1e82 subs r2, r0, #2 6: 1e41 subs r1, r0, #1 00000008 <.loop>: 8: fb10 f002 smulbb r0, r0, r2 c: 3a02 subs r2, #2 e: fb11 f103 smulbb r1, r1, r3 12: 3b02 subs r3, #2 14: dcf8 bgt.n 8 <.loop> 16: fb10 f001 smulbb r0, r0, r1 1a: 4770 bx lr 0000001c <.tiny_input>: 1c: bf08 it eq 1e: 2006 moveq r0, #6 20: 4770 bx lrSo it's 0x22 bytes for this function. (Or 0x26 if we want to handle 0! = 1.)
It's larger than your version (your byte count includes some constants in memory, and the mov instructions to produce input), but in theory maybe better than twice as fast for large input, on CPUs with pipelined multipliers). And maybe much faster for inputs from 1 to 3, where it just branches once and produces the result.
You probably don't have anything like a Cortex-A9, because your 1.1 microseconds = 22 clock cycles means a 20MHz clock speed, while Cortex-A9 was available in 0.8 to 2GHz.
So maybe you have a much simpler in-order core like Cortex M3? M3 does support the mul instruction, and Thumb2 mode. And wikipedia says its multiply is 1 cycle! So that's weird, I'm surprised it has that efficient a multiplier. Or just that it clocks so slowly that there's time for a lot of gate delays in 1 stage, and it's only a 3-stage pipeline.
Cortex-M3 version:subs and muls are single-cycle on Cortex-M3. I haven't found perf numbers on branches, but they're common so I'm assuming it's probably 1 cycle and doesn't cause a big fetch bubble (if correctly predicted...). The Cortex-M3 HTML manual has a section on Branch target forwarding which appears to be about reducing the fetch bubble.
Its instruction timing table shows b<cond> costs 1 cycle for not-taken, or 2 cycles for taken. (1 for the branch, 1 for the pipeline reload after an immediate displacement.). So taken branches are slow compared to sub/mul and unrolling would be valuable, so my code above should still work well. (But multiple product accumulators are not necessary, so it can be simplified).
Optimizing for code-size: ;; UNTESTED THUMB ;; Input: n in r0. (n is signed positive, otherwise we return n.) ;; Output: n! in r0. ;; clobbers: r1 fact: subs r1, r0, #1 ; i = n-1 bls .Ltiny_input ; jump if n<=1 .Lloop: ; do { muls r0, r1 ; prod *= i subs r1, #1 ; --i bgt .Lloop ; while(--i > 0); signed condition ; r1 = 0, r0 = n! ; last multiply was a redundant prod *= 1 but avoiding that would take a cmp .Ltiny_input: ; alternate return path for tiny inputs ; 0! = 1 case is not handled, nor are negative inputs bx lr ; or inline thisI think that's the smallest we can manage. The loop has 3 instructions, and probably costs 4 cycles per iteration (1 + 1 + 2, the taken branch costing 2 cycles).
00000000 <fact>: 0: 1e41 subs r1, r0, #1 2: d902 bls.n a <fact+0xa> 4: 4348 muls r0, r1 6: 3901 subs r1, #1 8: dcfc bgt.n 4 <fact+0x4> a: 4770 bx lr # don't count this if inliningSo this is 0xa = 10 bytes, not counting the bx lr return instruction.
We could handle the 0! = 1 case with an IT block after the first subs, before the branch, so we can still jump to right after the loop (instead of to a separate block like my Cortex-A9 version). You could use this trick for it, too, though.
subs r1, r0, #1 ; i = n-1 it lt movlt r0, #1 ; n = 1 for n<1 bls .Ltiny_input ; return n if n was <=1If we needed more range for the branch, we could use itt ls / movls r0, #1, so the branch was inside the IT block (where branch instructions can use an encoding that spends more bits on displacement and none on the predicate). But it's a short range in this case, so I chose to leave r0 unmodified in the r0 == 1 case. I don't know if there are any CPUs where it's more efficient or lower latency for a predicated instruction to be a NOP instead of running, but there might be.
Without unrolling, putting a cmp in the loop to avoid the last *=1 iteration would cost us an extra cycle per iteration (4 cycles instead of 3), so only pay for itself with n=2 or maybe n=3.
Unrolling could help speed significantly for larger inputs, going from 1 mul per 3 cycles to asymptotically approaching 1 mul per 2 cycles (sub + mul + amortized loop overhead). I can't see any way to avoid an instruction like sub or mov to generate a separate input for each mul, except by hard-coding special case sequences for each n (like *2 * 4 = *8 = left shift by 3) when you could instead just hard-code the answer.