162. stream¶
stream是内存性能评估的工业标准之一,工具现由弗吉尼亚计算机系维护。
官方指导: 教程
162.1. 下载源码¶
这里以C源码为例。
wget https://www.cs.virginia.edu/stream/FTP/Code/stream.c
完整的项目代码,请访问 链接 ## 编译
gcc -O2 -mcmodel=large -fopenmp -DSTREAM_ARRAY_SIZE=10000000 -DNTIMES=30 -DOFFSET=4096 stream.c -o stream
-mcmodel=large 大内存服务器使用参数
-DSTREAM_ARRAY_SIZE=10000000 根据L3 cache的大小选择数组元素,使数组的占用的内存大小超过L3 cache的大小
-DNTIMES=30 执行测试的次数,选择最好的依次打印
-DOFFSET=4096 有可能改变数组再内存中的对齐方式
162.2. 执行¶
./stream
162.3. 1616服务器¶
me@ubuntu:~/code/stream$ ./stream
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 100000000 (elements), Offset = 4096 (elements)
Memory per array = 762.9 MiB (= 0.7 GiB).
Total memory required = 2288.8 MiB (= 2.2 GiB).
Each kernel will be executed 30 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 64
Number of Threads counted = 64
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 37823 microseconds.
(= 37823 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 58415.5 0.033761 0.027390 0.073813
Scale: 58925.3 0.031476 0.027153 0.074888
Add: 56900.2 0.047931 0.042179 0.076715
Triad: 57035.6 0.049256 0.042079 0.089866
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
162.4. 1620服务器¶
[me@centos stream]$ ./stream
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 4096 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 30 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Number of Threads requested = 128
Number of Threads counted = 128
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 3460 microseconds.
(= 3460 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 103292.1 0.002324 0.001549 0.004953
Scale: 89145.7 0.002493 0.001795 0.004599
Add: 101608.3 0.003173 0.002362 0.004439
Triad: 105318.4 0.003154 0.002279 0.005893
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
162.5. ARM树莓派执行结果¶
树莓派总内存大小为1GB,内存频率没有标明
pi@raspberrypi:~/app/stream $ ./stream
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 114310 microseconds.
(= 114310 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 2030.0 0.079971 0.078817 0.083276
Scale: 2030.5 0.080576 0.078797 0.084133
Add: 1912.1 0.126776 0.125519 0.129104
Triad: 1652.5 0.145481 0.145232 0.145794
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
162.6. x86 PC执行结果¶
root@SZX:~/working/stream# ./stream
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 14092 microseconds.
(= 14092 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 7528.7 0.024472 0.021252 0.027480
Scale: 7773.3 0.024656 0.020583 0.028275
Add: 7866.3 0.034299 0.030510 0.036829
Triad: 8017.6 0.035185 0.029934 0.038185
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
root@SZX:~/working/stream#
162.7. x86 服务器执行结果¶
me@Board:~/stream$ ./stream
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 10000000 (elements), Offset = 0 (elements)
Memory per array = 76.3 MiB (= 0.1 GiB).
Total memory required = 228.9 MiB (= 0.2 GiB).
Each kernel will be executed 10 times.
The *best* time for each kernel (excluding the first iteration)
will be used to compute the reported bandwidth.
-------------------------------------------------------------
Your clock granularity/precision appears to be 1 microseconds.
Each test below will take on the order of 26998 microseconds.
(= 26998 clock ticks)
Increase the size of the arrays if this shows that
you are not getting at least 20 clock ticks per test.
-------------------------------------------------------------
WARNING -- The above is only a rough guideline.
For best results, please be sure you know the
precision of your system timer.
-------------------------------------------------------------
Function Best Rate MB/s Avg time Min time Max time
Copy: 8830.0 0.018140 0.018120 0.018157
Scale: 8800.5 0.018211 0.018181 0.018317
Add: 9812.8 0.024520 0.024458 0.024679
Triad: 9722.5 0.024715 0.024685 0.024746
-------------------------------------------------------------
Solution Validates: avg error less than 1.000000e-13 on all three arrays
-------------------------------------------------------------
me@Board:~/stream$ lscpu
162.8. 结果分析¶
1616内存硬件信息:
Array Handle: 0x0007
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 32 GB
Form Factor: DIMM
Set: None
Locator: DIMM120 J17
Bank Locator: SOCKET 1 CHANNEL 2 DIMM 0
Type: DDR4
Type Detail: Synchronous Registered (Buffered)
Speed: 2400 MT/s
Manufacturer: Samsung
Serial Number: 0x35125924
Asset Tag: 1709
Part Number: M393A4K40BB1-CRC
Rank: 2
Configured Clock Speed: 2400 MT/s
Minimum Voltage: 1.2 V
Maximum Voltage: 1.2 V
Configured Voltage: 1.2 V
数量:4
1620内存硬件信息:
Array Handle: 0x0006
Error Information Handle: Not Provided
Total Width: 72 bits
Data Width: 64 bits
Size: 32 GB
Form Factor: DIMM
Set: None
Locator: DIMM170 J31
Bank Locator: SOCKET 1 CHANNEL 7 DIMM 0
Type: DDR4
Type Detail: Synchronous Registered (Buffered)
Speed: 2666 MT/s
Manufacturer: Samsung
Serial Number: 0x40C3BA1D
Asset Tag: 1838
Part Number: M393A4K40BB2-CTD
Rank: 2
Configured Clock Speed: 2666 MT/s
Minimum Voltage: 1.2 V
Maximum Voltage: 2.0 V
Configured Voltage: 1.2 V
数量:16
计算公式:
speed * data size /8 * DIMM number / 1024 /1024 = bandwidth
服务器 | 理论带宽 | stream测试值 |
---|---|---|
1616 | 2400*64/8*4/1024/1024=75GiB/s | 55GiB/s |
1620 | 2666*64/8*16/1024/1024=333GiB/s | 102GiB/s |
162.9. 问题记录:静态数组内存大小限制¶
当设置的数组大小比较大时,编译器会给出报警。
[root@localhost stream]# gcc -DSTREAM_ARRAY_SIZE=100000000 stream.c -o option_no_100M_stream
/tmp/ccTzV1dQ.o: In function `main':
stream.c:(.text+0x546): relocation truncated to fit: R_X86_64_32S against `.bss'
stream.c:(.text+0x57a): relocation truncated to fit: R_X86_64_32S against `.bss'
stream.c:(.text+0x5f9): relocation truncated to fit: R_X86_64_32S against `.bss'
stream.c:(.text+0x62e): relocation truncated to fit: R_X86_64_32S against `.bss'
stream.c:(.text+0x65e): relocation truncated to fit: R_X86_64_32S against `.bss'
stream.c:(.text+0x6a0): relocation truncated to fit: R_X86_64_32S against `.bss'
stream.c:(.text+0x6b9): relocation truncated to fit: R_X86_64_32S against `.bss'
stream.c:(.text+0x6c5): relocation truncated to fit: R_X86_64_32S against `.bss'
stream.c:(.text+0x6dd): relocation truncated to fit: R_X86_64_32S against `.bss'
collect2: error: ld returned 1 exit status
[root@localhost stream]#
解决办法是添加编译选项
-mcmodel=medium