2000 | Scott Rixner, William J. Dally, Ujval J. Kapasi, Peter Mattson, and John D. Owens
Memory access scheduling is a technique that improves the performance of a memory system by reordering memory references to exploit locality within the 3-D memory structure of DRAM. This paper introduces memory access scheduling, which can improve memory bandwidth by up to 144% over a system with no access scheduling. For media processing applications, memory access scheduling improves sustained memory bandwidth by 30%, and the traces of these applications offer a potential bandwidth improvement of up to 93%. The paper discusses the characteristics of modern DRAM architecture, introduces the concept of memory access scheduling and possible algorithms for reordering DRAM operations, describes the streaming media processor and benchmarks used to evaluate memory access scheduling, presents a performance comparison of various memory access scheduling algorithms, and discusses related work to memory access scheduling.
Modern DRAMs are three-dimensional memory devices with dimensions of bank, row, and column. Sequential accesses to different rows within one bank have high latency and cannot be pipelined, while accesses to different banks or different words within a single row have low latency and can be pipelined. The three-dimensional nature of modern memory devices makes it advantageous to reorder memory operations to exploit the non-uniform access times of the DRAM. This optimization is similar to how a superscalar processor schedules arithmetic operations out of order.
The paper introduces memory access scheduling, which is the process of ordering DRAM operations (bank precharge, row activation, and column access) necessary to complete the set of currently pending memory references. The memory access scheduler must generate a schedule that conforms to the timing and resource constraints of these modern DRAMs. The paper discusses the characteristics of modern DRAM architecture, introduces the concept of memory access scheduling and possible algorithms for reordering DRAM operations, describes the streaming media processor and benchmarks used to evaluate memory access scheduling, presents a performance comparison of various memory access scheduling algorithms, and discusses related work to memory access scheduling.
The paper presents experimental results showing that memory access scheduling significantly improves memory bandwidth. For example, the unit load benchmark achieves 97% of the peak bandwidth of the DRAMs with no access scheduling. The 14% drop in sustained bandwidth from the unit load benchmark to the unit benchmark shows the performance degradation imposed by forcing intermixed load and store references to complete in order. The unit conflict benchmark further shows the penalty of swapping back and forth between rows in the DRAM banks, which drops the sustained bandwidth down to 51% of the peak. The random benchmarks sustain about 15% of the bandwidth of the unit load benchmark.
The paper also discusses the effects of varying the bank buffer size on sustained memory bandwidth when using memory access scheduling. The row/closed scheduling algorithm is used with bank buffers varying in size from 4 to 64 entries. The unit load benchmark requires only 8 entries to saturate the memory system. The unit conflict and random benchmarks require 16 entries to achieve their peak bandwidth. TheMemory access scheduling is a technique that improves the performance of a memory system by reordering memory references to exploit locality within the 3-D memory structure of DRAM. This paper introduces memory access scheduling, which can improve memory bandwidth by up to 144% over a system with no access scheduling. For media processing applications, memory access scheduling improves sustained memory bandwidth by 30%, and the traces of these applications offer a potential bandwidth improvement of up to 93%. The paper discusses the characteristics of modern DRAM architecture, introduces the concept of memory access scheduling and possible algorithms for reordering DRAM operations, describes the streaming media processor and benchmarks used to evaluate memory access scheduling, presents a performance comparison of various memory access scheduling algorithms, and discusses related work to memory access scheduling.
Modern DRAMs are three-dimensional memory devices with dimensions of bank, row, and column. Sequential accesses to different rows within one bank have high latency and cannot be pipelined, while accesses to different banks or different words within a single row have low latency and can be pipelined. The three-dimensional nature of modern memory devices makes it advantageous to reorder memory operations to exploit the non-uniform access times of the DRAM. This optimization is similar to how a superscalar processor schedules arithmetic operations out of order.
The paper introduces memory access scheduling, which is the process of ordering DRAM operations (bank precharge, row activation, and column access) necessary to complete the set of currently pending memory references. The memory access scheduler must generate a schedule that conforms to the timing and resource constraints of these modern DRAMs. The paper discusses the characteristics of modern DRAM architecture, introduces the concept of memory access scheduling and possible algorithms for reordering DRAM operations, describes the streaming media processor and benchmarks used to evaluate memory access scheduling, presents a performance comparison of various memory access scheduling algorithms, and discusses related work to memory access scheduling.
The paper presents experimental results showing that memory access scheduling significantly improves memory bandwidth. For example, the unit load benchmark achieves 97% of the peak bandwidth of the DRAMs with no access scheduling. The 14% drop in sustained bandwidth from the unit load benchmark to the unit benchmark shows the performance degradation imposed by forcing intermixed load and store references to complete in order. The unit conflict benchmark further shows the penalty of swapping back and forth between rows in the DRAM banks, which drops the sustained bandwidth down to 51% of the peak. The random benchmarks sustain about 15% of the bandwidth of the unit load benchmark.
The paper also discusses the effects of varying the bank buffer size on sustained memory bandwidth when using memory access scheduling. The row/closed scheduling algorithm is used with bank buffers varying in size from 4 to 64 entries. The unit load benchmark requires only 8 entries to saturate the memory system. The unit conflict and random benchmarks require 16 entries to achieve their peak bandwidth. The