dolution
dolution.
Consider a parallel implementation of matrix addition (A = B + C), where the matrices contain 1024 × 1024 elements each containing 8 bytes. The page size is 4 Kbytes. The shared-memory multiprocessor system contains 16 nodes in a cc-NUMA organization, each having a processor, a private cache, and a portion of the shared memory. The block size is 32 bytes. The matrices are stored in memory in row order (consecutive rows are allocated in the memory space consecutively), and A, B, and C are stored one after the other. The algorithm is parallelized in such a way that the first 1024/16 rows of the result matrix A will be computed by the first node, the next 1024/16 rows by the second node, etc.
(a) Assume that a round-robin static page-placement algorithm is used. How many accesses to the local versus remote memory will each node encounter?
Save your time - order a paper!
Get your paper written from scratch within the tight deadline. Our service is a reliable solution to all your troubles. Place an order on any task and we will take care of it. You won’t have to worry about the quality and deadlines
Order Paper Now(b) Assume that a page-migration scheme is used. The migration cost is the same as for 16 remote accesses, and a page is migrated when the cost of remote accesses exceeds twice the migration cost. How many accesses to the local versus remote memory will each node encounter?