Latest CNN architectures perform large number of computations and memory accesses, resulting in large energy consumption.A significant part of this energy consumption is due off-chip memory accesses of convolution layers on these accelerators. These accelerators have limited on-chip memory and apply loop tiling to read a fraction of the data. These data tiles are accessed repeatedly from off-chip memory, resulting in large off-chip memory accesses. The off-chip memory accesses can be reduced by carefully selecting the dimensions of the tile and adaptively selecting the data reuse scheme of the tiles. Off-chip memory accesses due to loop tiling depends not only on tile dimensions but address alignment of the tiles and data bus width. In this work we propose a scheme for determining the optimal tile dimensions and data flow scheme considering off-chip memory bus-width and tile addresses. Experimental results shows that for VGG16 we can on an average 40% of memory accesses on 256 bit wide memory bus compared to the state of the art adaptive tile based approaches.