We want to compute the sum between two vectors x and y.
\begin{flalign*}
x + y &= (x_{0},\,x_{1},\,\dots,\,x_{n-1})+(y_{0},\,y_{1},\,\dots y_{n+1}) \\
&= (x_{0}+y_{0},\,\dots x_{n-1}+y_{n-1}) \\
&= (z_{0},\,\dots z_{n-1}) \\
&= z
\end{flalign*}
The serial implementation of a vector sum is:
void vector_sum(double x[], double y[], double z[], int n){ int i; for(i = 0; i < n i++) z[i] = x[i] + y[i]; }
We can parallelize it (with data parallelism) by dividing the vector between the processes:
void parallel_vector_sum( double local_x[], // in double local_y[], // in double local_z[], // out int local_n // in) { int local_i; for (local_i=0; local_i<local_n; local_i++) local_z[local_i] = local_x[local_i] + local_y[local_i];}
matrices
Matrices are stored in memory as a single row, so we could use a MPI_Reduce to sum all the elements of a matrix
for(int i = 0; i < num_rows; i++){ MPI_Reduce(a[i], recvbuf[i], num_cols, MPI_INT, MPI_SUM, 0, MPI_COMM_WORLD);}
(since each element of a matrix is an array, a reduce needs to be called for each array)
linearizing matrices
we can also linearize matrices by allocating them in a continuous row - we have to access them differently, but collectives become significantly easier
void Mat_vect_mult( double A[], // in (matrix) double x[], // in (vector) double y[], // out (result of the multiplication) int m, // in (rows) int n, // in (columns)) { int i,j; for(i=0; i<m; i++) { y[i] = 0.0; // clean slate in y for (j=0; j<n; j++) y[i] += A[i*n+j]*x[j] // product }}
How do we parallelize this?
broadcast the vector x from rank 0 to all other processes
scatter the rows of the matrix from rank 0 to the other processes
each process computes a subset of the elements of the resulting vector y
gather the final vector y to rank 0
broadcasty from rank 0 to all other processes
reading a matrix from stdin and scattering it
void Read_matrix( char prompt[], // in double local_A[], //out int m, // in int local_m, // in int n, // in int my_rank, // in MPI_Comm comm // in) { double* A = NULL; int i, j; // only rank 0 will allocate and use A if (my_rank == 0) { // the matrix is stored as a contiguous 1D array A = malloc(m*n*sizeof(double)); printf("Enter the matrix %s\n", prompt) for (i=0; i<m; i++) for (j=0; j<n; j++) // (the i,j coordinates are converted into a 1D index) scanf("%lf", &A[i*n+j]); // scattered MPI_Scatter(A, local_m*n, MPI_DOUBLE, local_A, local_m*n, MPI_DOUBLE, 0, comm); free(A); } else { // the other processes receive the rows of the matrix MPI_Scatter(A, local_m*n, MPI_DOUBLE, local_A, local_m*n, MPI_DOUBLE, 0, comm); }}
We can use MPI_Allgather to perform steps 4&5 at once.
void Mat_vect_mult( double local_A[], // in double local_x[], // in double local_y[], // out int local_m, // in int n, // in int local_n, // in MPI_Comm comm, // in) { double* x; int local_i, j; x = malloc(n*sizeof(double)); // gather x from the different processes // (all the processes will call Mat_vect_mult) MPI_Allgather(local_x, local_n, MPI_DOUBLE, x, local_n, MPI_DOUBLE, comm); for (local_i=0; local_i<local_m; local_i++) { local_y[local_i] = 0.0; for (j=0; j<n; j++) // only local_m rows are used local_y[local_i] += local_A[local_i*n+j]*x[j]; } // allgather to y (steps 4&5) MPI_Allgather(local_y, local_m, MPI_DOUBLE, y, local_m, MPI_DOUBLE, comm); free(x); free(y);}
(so, in the main function, one would call Read_matrix first and then Mat_vect_mult)