//======================================================================================================================================================150
//	DESCRIPTION
//======================================================================================================================================================150

This is the CUDA version of the code.

The code calculates particle potential and relocation due to mutual forces between particles within a large 3D space. This space is 
divided into cubes, or large boxes, that are allocated to individual cluster nodes. The large box at each node is further divided into 
cubes, called boxes. 26 neighbor boxes surround each box (the home box). Home boxes at the boundaries of the particle space have fewer neighbors. 
Particles only interact with those other particles that are within a cutoff radius since ones at larger distances exert negligible forces. Thus the 
box size s chosen so that cutoff radius does not span beyond any neighbor box for any particle in a home box, thus limiting the reference space to 
a finite number of boxes.

This code [1] was derived from the ddcMD application [2] by rewriting the front end and structuring it for parallelization. This code represents MPI 
task that runs on a single cluster node. While the details of the code are somewhat different than the original, the code retains the structure of the 
MPI task in the original code. Since the rest of MPI code is not included here, the application first emulates MPI partitioning of the particle space 
into boxes. Then, for every particle in the home box, the nested loop processes interactions first with other particles in the home box and then with 
particles in all neighbor boxes. The processing of each particle consists of a single stage of calculation that is enclosed in the innermost loop. The
nested loops in the application were parallelized in such a way that at any point of time GPU warp/wavefront accesses adjacent memory locations. The
speedup depends on the number of boxes, particles (fixed) and the actualcal culation for each particle (fixed). The application is memory bound, and 
GPU speedup seems to saturate at about 16x when compared to single-core CPU.

More information about the parallel version of this code can be found in:
[1] L. G. Szafaryn, T. Gamblin, B. deSupinski and K. Skadron. "Experiences with Achieving Portability across Heterogeneous Architectures." Submitted to
WOLFHPC workshop at 25th International Conference on Supercomputing (ICS). Tucson, AZ. 2010.
More about the original ddcMD application can be found in:
[2] F. H. Streitz, J. N. Glosli, M. V. Patel, B. Chan, R. K. Yates, B. R. de Supinski, J. Sexton, J and A. Gunnels. "100+ TFlop Solidification Simulations 
on BlueGene/L." In Proceedings of the 2005 Supercomputing Conference (SC 05). Seattle, WA. 2005.

//======================================================================================================================================================150
//	USE
//======================================================================================================================================================150

The code takes the followint parameters:
-boxes1d		(number of boxes in one dimension, the total number of boxes will be that^3)

The code can be run as follows:
./lavaMD -boxes1d 10

******Adjustable work group size*****
RD_WG_SIZE_0 or RD_WG_SIZE_0_0 

USAGE:
make clean
make KERNEL_DIM="-DRD_WG_SIZE_0=128"

######OUTPUT FOR VALIDATION########
USAGE:
make clean
make OUTPUT=Y