Latest CNN architectures perform large number of computations and memory accesses, resulting in large energy consumption.A significant part of this energy consumption is due off-chip memory accesses of convolution layers on these accelerators.
These accelerators have limited on-chip memory and apply loop
tiling to read a fraction of the data. These data tiles are accessed repeatedly from off-chip memory, resulting in large off-chip
memory accesses. The off-chip memory accesses can be reduced
by carefully selecting the dimensions of the tile and adaptively
selecting the data reuse scheme of the tiles.
Off-chip memory accesses due to loop tiling depends not only on tile dimensions but address alignment of the tiles and data bus width. In this work we propose a scheme for determining the optimal tile dimensions and data flow scheme considering off-chip memory bus-width and tile addresses. Experimental results shows that for VGG16 we can on an average  40% of memory accesses on 256 bit wide memory bus compared to the state of the art adaptive tile based approaches.