The purpose of the tuning system is to generate the configuration file rapptune.h that is needed when building the library. The tuning system does this by running a suite of benchmark tests, and analyzing the measured performance for each candidate implementation, as shown in figure 2.
The tuning system consists of all files in the compute/tune
and compute/tune/benchmark
directories. It is layered on top of the standard build system. When needed, it is executed as part of e.g. make
all
. By separating the tuning system from the build system, the latter can be kept simple, and we can reuse it for the tuning purposes.
If the library is not already tuned, we add the tune
directory to the SUBDIRS variable of compute/Makefile.am
, to connect the tuning system to the ordinary build system. This effectively makes the build system re-entrant. That might seem like a contradiction to what was stated earlier about separation, but it is really only a matter of letting one system dispatch the other one. Their inner workings are still kept hidden from each other.
The tuning process consists of the following steps:
--with-internal-tune-generation=CAND
, where CAND has the form <impl>,<unroll>, specifying the implementation and unroll factor for the candidate. Besides causing various RAPP_FORCE flags to be set for the build, this internal option shortcuts those parts of RAPP that don't apply when tuning, for example stopping compute/tune
from being used, and stopping re-generation of e.g. documentation. The parts that need to be aware of this re-entrancy are confined to the top-level configure.ac
script and the compute/tune/Makefile.am
file.-j
option is used.archive
as separate libraries, named rappcompute_tune_<impl>_<unroll>
.compute/tune/benchmark
.rappmeasure.run
, containing the library candidates, the benchmark application, the script compute/tune/measure.sh
and the progress bar script compute/tune/progress.sh
.rappmeasure.run
on the target platform. Otherwise it will be executed automatically. When finished, it has produced a data file tunedata.py
.compute/tune/analyze.py
on the data file. It creates the configuration header rapptune.h and a report tunereport.html
.After tuning, all the generated files are located in the compute/tune
directory of the build tree. To make RAPP tuned for the platform for everyone else, they must be copied to the source directory and/or added to the distribution. A tarball to send to the maintainers, containing the necessary files, can be created using the make-target export-new-archfiles
. There's also a make-target update-tune-cache
to use for copying the generated tune-file and HTML report to the right place and name in the local source directory. Alternatively, together with benchmark HTML after benchmark tests, use the make-target update-archfiles
.
The benchmark
application takes the Compute layer library as an argument and loads it dynamically. It then runs its benchmark tests for the functions found in library, measuring the throughput in pixels/second. If a function is not found, the throughput is zero.
The script measure.sh
runs the benchmark application with different library implementations and different image sizes. It generates a data file in Python format containing all measurement data.
When the measurement data file is generated, the Python script analyze.py
is used to analyze the data and determine the optimal implementations and parameters and generate the configuration header rapptune.h. To be able to compare the performance between two implementations, we need some sort of metric.
For a particular function, we can have several possible implementations. Order them from 1 to N, where N denotes the total number of implementations. For each implementation we also have several benchmark tests, corresponding to different image sizes. Let M denote the number of tests. For our function, we get an matrix of measurements in pixels/second:
We want to compute a ranking number for each implementation of the function. First we compute the average throughput across all implementations:
Next, we normalize the data with this average value, creating a data set of dimensionless values,
These normalized numbers describe the speedup for a given implementation and test case, compared to the average performance of this test case. The normalized numbers are independent of the absolute throughput of each test case. This is what we want, since a fast test case could otherwise easily dwarf the results of the other tests. We want all test cases to contribute equally.
Finally, we compute the dimensionless ranking result as the arithmetic mean of the speedup results across all the test cases,
The implementation with the highest ranking gets picked and the parameters are written to the configuration header rapptune.h.
The analyze.py
script also produces a bar plot of the tuning result in HTML format. It shows the relative speedup for the fastest one (any unroll factor) of the generic, SWAR and SIMD implementations. The gain factor reported is the ranking result, normalized with respect to the slowest bar plotted. Only functions with at least two different implementations are included in the plot.