Accessing data through a layer of indirection has some performance impact. The most important aspect of the performance impact however, is if the indirection is predictable. If the ultimately accessed data is accessed in a coherent order, even through a layer of indirection, performance is still pretty good.
Some results of the code below, which tests 2^26 memory reads done in various ways:
Data size: 256.000 MB no indirection elapsed: 0.02169 s (12.38 GiB/s, 1.196 cycles/read) sum: 4261412864 predictable indices elapsed: 0.03996 s (6.718 GiB/s, 2.203 cycles/read) sum: 4261412864 unpredictable indices elapsed: 0.9277 s (0.2894 GiB/s, 51.15 cycles/read) sum: 4261412864As you can see, adding a layer of indirection has some performance impact (~2.2 cycles per read instead of ~1.2 cycles.). However, as long as the accessed data is accessed in a predictable, coherent fashion, then the cache and prefetcher on the CPU means the slowdown is not too bad.
If, however, the order of access is not coherent and predictable, performance takes a nose-dive, as the CPU cache and prefetcher cannot effectively fetch memory before it is needed or efficiently when it is need, and massive latency is incurred.
This kind of performance degradation is what I expect from unpredictable and incoherent memory access, with or without a layer of indirection.
The key point is that it's not indirection per se that is slow, but unpredictable and incoherent memory access patterns that are commonly seen with indirection.
const int N = 1 << 26; // num reads const double cycles_per_sec = 3.7e9; // For my i7 Ivy bridge, running in 'turbo-boost' mode. // Allocate some data, fill with semi-random values. std::vector<uint32> data(N); for(int i=0; i<N; ++i) data[i] = i % 128; conPrint("Data size: " + getNiceByteSize(N * sizeof(uint32))); // Test data access without indirection { Timer timer; uint64 sum = 0; for(int i=0; i<N; ++i) { sum += data[i]; } double elapsed = timer.elapsed(); const double bytes_per_sec = 4 * N / elapsed; const double cycles_per_read = (elapsed * cycles_per_sec) / N; // num cycles = elapsed * cycles-per-sec conPrint("no indirection elapsed: " + doubleToStringNSigFigs(elapsed, 4) + " s (" + doubleToStringNSigFigs(bytes_per_sec * 1.0e-9, 4) + " GiB/s, " + doubleToStringNSigFigs(cycles_per_read, 4) + " cycles/read)"); printVar(sum); } // Test indirection with predictable indices { std::vector<uint32> indices(N); for(int i=0; i<N; ++i) indices[i] = i; Timer timer; uint64 sum = 0; for(int i=0; i<N; ++i) { sum += data[indices[i]]; } double elapsed = timer.elapsed(); const double bytes_per_sec = 4 * N / elapsed; const double cycles_per_read = (elapsed * cycles_per_sec) / N; conPrint("predictable indices elapsed: " + doubleToStringNSigFigs(elapsed, 4) + " s (" + doubleToStringNSigFigs(bytes_per_sec * 1.0e-9, 4) + " GiB/s, " + doubleToStringNSigFigs(cycles_per_read, 4) + " cycles/read)"); printVar(sum); } // Test indirection with unpredictable indices { std::vector<uint32> indices(N); for(int i=0; i<N; ++i) indices[i] = i; // Permute indices for(int t=N-1; t>=0; --t) { int k = (int)(rng.unitRandom() * t); mySwap(indices[t], indices[k]); } Timer timer; uint64 sum = 0; for(int i=0; i<N; ++i) { sum += data[indices[i]]; } double elapsed = timer.elapsed(); const double bytes_per_sec = 4 * N / elapsed; const double cycles_per_read = (elapsed * cycles_per_sec) / N; conPrint("unpredictable indices elapsed: " + doubleToStringNSigFigs(elapsed, 4) + " s (" + doubleToStringNSigFigs(bytes_per_sec * 1.0e-9, 4) + " GiB/s, " + doubleToStringNSigFigs(cycles_per_read, 4) + " cycles/read)"); printVar(sum); }