Loading...

Performance comparison TI AM4378 vs TI AM6548

March 2, 2019
to all articles
Today, we have requests from our customers regarding moving old hardware solution to a new platform with higher performance and new features. Our customers would like to know performance difference between Texas Instruments AM4378 and AM6548 SoC. 
We would like to show only performance difference based on ARM cores.
As reference we are using 2 kits:

AM65x evaluation module (EVM) (TMDX654GPEVM)
am65idk-evm-plus-lcd-screen.png

AM437x Starter Kit (TMDXSK437X)
AM437x SK

Tests

Dynamic frequency scaling is off.

Memory bandwidth tests:
AM4378 AM6548
DDR Bandwidth

BANDWIDTH MEASUREMENTS
(MB)     (MB/s)
----------------------
bw_mem 1M rd
1.00     367.92
bw_mem 1M rdwr
1.00     290.15
bw_mem 1M cp
1.00     199.68
bw_mem 1M frd
1.00     239.58
bw_mem 1M fcp
1.00     193.16
bw_mem 1M bzero
1.00     672.16
bw_mem 1M bcopy
1.00     196.73
DDR Bandwidth

BANDWIDTH MEASUREMENTS
(MB)     (MB/s)
----------------------
bw_mem 1M rd
1.00     1401.79
bw_mem 1M rdwr
1.00     1119.82
bw_mem 1M cp
1.00     521.29
bw_mem 1M frd
1.00     1372.21
bw_mem 1M fcp
1.00     1051.71
bw_mem 1M bzero
1.00    4128.17
bw_mem 1M bcopy
1.00    1056.07

Memory latency tests:
AM4378 AM6548
Memory Latency

Blocksize    latency
  (MB)       (ns)
----------------------
“stride=128
0.00049      4.019
0.00098      4.019
0.00195      4.019
0.00293      4.019
0.00391      4.019
0.00586      4.019
0.00781      4.024
0.01172      4.020
0.01562      4.031
0.02344      11.037
0.03125      9.284
0.04688      14.722
0.06250      15.354
0.09375      16.474
0.12500      16.369
0.18750      16.664
0.25000      48.936
0.37500      77.136
0.50000      85.192
0.75000      95.469
1.00000      98.413
Memory Latency

Blocksize    latency
  (MB)       (ns)
----------------------
“stride=64
0.00049      3.760
0.00098      3.759
0.00195      3.761
0.00293      3.760
0.00391      3.760
0.00586      3.760
0.00781      3.760
0.01172      3.760
0.01562      3.762
0.02344      3.762
0.03125      3.781
0.04688      7.994
0.06250      8.571
0.09375      9.613
0.12500      9.928
0.18750      10.263
0.25000      10.346
0.37500      10.436
0.50000      10.666
0.75000      32.556
1.00000      44.603

Whetstone:
AM4378 AM6548
Execution time approx. 10 seconds


Loops: 100000, Iterations: 1, Duration: 5 sec.
C Converted Double Precision Whetstones: 2000.0 MIPS
Execution time approx. 10 seconds


Loops: 100000, Iterations: 1, Duration: 4 sec.
C Converted Double Precision Whetstones: 2500.0 MIPS


Dhrystone:
AM4378 AM6548
Microseconds for one run through Dhrystone: 0.2
Dhrystones per Second: 4081632.8

CPU clock = 1000 MHz
Dhrystone DMIPS/MHz =   2.3
Dhrystone Benchmark, Version 2.1+Thread (Language: C)
Stage 1: find good iteration count without threads
Attempting 100000 iterations
Attempting 200000 iterations
Attempting 400000 iterations
Attempting 800000 iterations
Attempting 1600000 iterations
Attempting 3200000 iterations
Attempting 6400000 iterations
dhrystones 2910932, dmips=1567
Stage 2: find best number of threads
6400000 iterations * 1 threads
dhrystones 2908775, dmips=1566
6400000 iterations * 2 threads
dhrystones 5818650, dmips=3133
6400000 iterations * 4 threads
dhrystones 11313222, dmips=6092
6400000 iterations * 8 threads
dhrystones 11401372, dmips=6139



NBench:

AM4378 AM6548

BYTEmark* Native Mode Benchmark ver. 2 (10/95)

Test            : Iterations/sec.   : Old Index Pentium 90* :
----------------:-------------------:-----------------------:
NUMERIC SORT    :           585.52  :            15.02      :
STRING SORT     :           63.456  :            28.35      :
BITFIELD        :        1.982e+08  :            34.00      :
FP EMULATION    :           87.378  :            41.93      :
FOURIER         :           7164.5  :             8.15      :
ASSIGMENT       :           8.3246  :            31.68      :
IDEA            :           1649.7  :            25.23      :
HUFFMAN         :            881.1  :            24.43      :
NEURAL NET      :           8.4186  :            13.52      :
LU DECOMPOSITION:           308.96  :            16.01      :
====== ORIGINAL BYTEMARK RESULTS ============================
INTEGER INDEX   :           27.486
FLOATING-POINT INDEX :      12.082
Baseline (MSDOS*):  Pentium* 90, 256 KB L2-cache
====== LINUX DATA BELOW =====
CPU                  :  ARMv7 Processor rev 10 (v7l)
L2 Cache             :
OS                   :  Linux 4.14.79-rt47-g28d73230da
C compiler           :  arm-linux-gnueabihf-gcc
libc                 :  static
MEMORY INDEX         :  6.350
INTEGER INDEX        :  7.267
FLOATING-POINT INDEX :  6.701
Baseline (LINUX)     :  AMD K6/233*, 512 KB L2-cache, gcc 2.7.2

BYTEmark* Native Mode Benchmark ver. 2 (10/95)

Test            : Iterations/sec.   : Old Index Pentium 90* :
----------------:-------------------:-----------------------:
NUMERIC SORT    :           450.25  :            11.55      :
STRING SORT     :            94.81  :            42.36      :
BITFIELD        :       1.2654e+08  :            21.71      :
FP EMULATION    :           62.509  :            29.99      :
FOURIER         :             6823  :             7.76      :
ASSIGMENT       :           8.4231  :            32.05      :
IDEA            :           1958.2  :            29.95      :
HUFFMAN         :           638.22  :            17.70      :
NEURAL NET      :           4.3151  :             6.93      :
LU DECOMPOSITION:           320.97  :            16.63      :
====== ORIGINAL BYTEMARK RESULTS ============================
INTEGER INDEX   :           24.573
FLOATING-POINT INDEX :      9.635
Baseline (MSDOS*):  Pentium* 90, 256 KB L2-cache
====== LINUX DATA BELOW =====
CPU                  :  4 CPU
L2 Cache             :
OS                   :  Linux 4.14.67-gd315a9bb00
C compiler           :  aarch64-linux-gnu-gcc
libc                 :  static
MEMORY INDEX         :  6.276
INTEGER INDEX        :  6.026
FLOATING-POINT INDEX :  5.344
Baseline (LINUX)     :  AMD K6/233*, 512 KB L2-cache, gcc 2.7.2


Result:

Most of application is using external memory a lot, so memory bandwidth and memory latency are very important. Need to take into account that TI AM6548 has 64-bit memory bus width.

AM6548 has memory performance in 3,86 times higher than AM4378 (direct test).
CPU frequency of both SOC's almost the same (1GHz vs 1.1GHz), and difference only in 32-bit (AM4378) and 64-bit architecture (AM6548) + 4 cores.
Pure CORE performance comparison (single core, ARM instruction set) will not gain a lot, around +25%. Main benefit can be because of bigger register's size (64-bit ARM) and also NEON64.
Multicore definitely will gain more performance, if it will be used correctly.