Performance comparison TI AM4378 vs TI AM6548
We would like to show only performance difference based on ARM cores.
As reference we are using 2 kits:
AM65x evaluation module (EVM) (TMDX654GPEVM)
AM437x Starter Kit (TMDXSK437X)
Tests
Dynamic frequency scaling is off.Memory bandwidth tests:
AM4378 | AM6548 |
---|---|
DDR Bandwidth BANDWIDTH MEASUREMENTS (MB) (MB/s) ---------------------- bw_mem 1M rd 1.00 367.92 bw_mem 1M rdwr 1.00 290.15 bw_mem 1M cp 1.00 199.68 bw_mem 1M frd 1.00 239.58 bw_mem 1M fcp 1.00 193.16 bw_mem 1M bzero 1.00 672.16 bw_mem 1M bcopy 1.00 196.73 |
DDR Bandwidth BANDWIDTH MEASUREMENTS (MB) (MB/s) ---------------------- bw_mem 1M rd 1.00 1401.79 bw_mem 1M rdwr 1.00 1119.82 bw_mem 1M cp 1.00 521.29 bw_mem 1M frd 1.00 1372.21 bw_mem 1M fcp 1.00 1051.71 bw_mem 1M bzero 1.00 4128.17 bw_mem 1M bcopy 1.00 1056.07 |
Memory latency tests:
AM4378 | AM6548 |
---|---|
Memory Latency Blocksize latency (MB) (ns) ---------------------- “stride=128 0.00049 4.019 0.00098 4.019 0.00195 4.019 0.00293 4.019 0.00391 4.019 0.00586 4.019 0.00781 4.024 0.01172 4.020 0.01562 4.031 0.02344 11.037 0.03125 9.284 0.04688 14.722 0.06250 15.354 0.09375 16.474 0.12500 16.369 0.18750 16.664 0.25000 48.936 0.37500 77.136 0.50000 85.192 0.75000 95.469 1.00000 98.413 |
Memory Latency Blocksize latency (MB) (ns) ---------------------- “stride=64 0.00049 3.760 0.00098 3.759 0.00195 3.761 0.00293 3.760 0.00391 3.760 0.00586 3.760 0.00781 3.760 0.01172 3.760 0.01562 3.762 0.02344 3.762 0.03125 3.781 0.04688 7.994 0.06250 8.571 0.09375 9.613 0.12500 9.928 0.18750 10.263 0.25000 10.346 0.37500 10.436 0.50000 10.666 0.75000 32.556 1.00000 44.603 |
Whetstone:
AM4378 | AM6548 |
---|---|
Execution time approx. 10 seconds Loops: 100000, Iterations: 1, Duration: 5 sec. C Converted Double Precision Whetstones: 2000.0 MIPS |
Execution time approx. 10 seconds Loops: 100000, Iterations: 1, Duration: 4 sec. C Converted Double Precision Whetstones: 2500.0 MIPS |
Dhrystone:
AM4378 | AM6548 |
---|---|
Microseconds for one run through Dhrystone: 0.2 Dhrystones per Second: 4081632.8 CPU clock = 1000 MHz Dhrystone DMIPS/MHz = 2.3 |
Dhrystone Benchmark, Version 2.1+Thread (Language: C) Stage 1: find good iteration count without threads Attempting 100000 iterations Attempting 200000 iterations Attempting 400000 iterations Attempting 800000 iterations Attempting 1600000 iterations Attempting 3200000 iterations Attempting 6400000 iterations dhrystones 2910932, dmips=1567 Stage 2: find best number of threads 6400000 iterations * 1 threads dhrystones 2908775, dmips=1566 6400000 iterations * 2 threads dhrystones 5818650, dmips=3133 6400000 iterations * 4 threads dhrystones 11313222, dmips=6092 6400000 iterations * 8 threads dhrystones 11401372, dmips=6139 |
NBench:
AM4378 | AM6548 |
---|---|
BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Test : Iterations/sec. : Old Index Pentium 90* : |
BYTEmark* Native Mode Benchmark ver. 2 (10/95)
Test : Iterations/sec. : Old Index Pentium 90* : |
Result:
Most of application is using external memory a lot, so memory bandwidth and memory latency are very important. Need to take into account that TI AM6548 has 64-bit memory bus width.
AM6548 has memory performance in 3,86 times higher than AM4378 (direct test).
CPU frequency of both SOC's almost the same (1GHz vs 1.1GHz), and difference only in 32-bit (AM4378) and 64-bit architecture (AM6548) + 4 cores.
Pure CORE performance comparison (single core, ARM instruction set) will not gain a lot, around +25%. Main benefit can be because of bigger register's size (64-bit ARM) and also NEON64.
Multicore definitely will gain more performance, if it will be used correctly.