-
Notifications
You must be signed in to change notification settings - Fork 0
/
sparse-matmul.html
604 lines (510 loc) · 24.2 KB
/
sparse-matmul.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en" xml:lang="en">
<head>
<!-- 2016-06-05 Sun 09:20 -->
<meta http-equiv="Content-Type" content="text/html;charset=utf-8" />
<meta name="viewport" content="width=device-width, initial-scale=1" />
<title>Communication-Avoiding Parallel Sparse-Dense Matrix Multiplication — The MPI Programming Assignment</title>
<meta name="generator" content="Org-mode" />
<meta name="author" content="Krzysztof Rzadca" />
<style type="text/css">
<!--/*--><![CDATA[/*><!--*/
.title { text-align: center;
margin-bottom: .2em; }
.subtitle { text-align: center;
font-size: medium;
font-weight: bold;
margin-top:0; }
.todo { font-family: monospace; color: red; }
.done { font-family: monospace; color: green; }
.priority { font-family: monospace; color: orange; }
.tag { background-color: #eee; font-family: monospace;
padding: 2px; font-size: 80%; font-weight: normal; }
.timestamp { color: #bebebe; }
.timestamp-kwd { color: #5f9ea0; }
.org-right { margin-left: auto; margin-right: 0px; text-align: right; }
.org-left { margin-left: 0px; margin-right: auto; text-align: left; }
.org-center { margin-left: auto; margin-right: auto; text-align: center; }
.underline { text-decoration: underline; }
#postamble p, #preamble p { font-size: 90%; margin: .2em; }
p.verse { margin-left: 3%; }
pre {
border: 1px solid #ccc;
box-shadow: 3px 3px 3px #eee;
padding: 8pt;
font-family: monospace;
overflow: auto;
margin: 1.2em;
}
pre.src {
position: relative;
overflow: visible;
padding-top: 1.2em;
}
pre.src:before {
display: none;
position: absolute;
background-color: white;
top: -10px;
right: 10px;
padding: 3px;
border: 1px solid black;
}
pre.src:hover:before { display: inline;}
pre.src-sh:before { content: 'sh'; }
pre.src-bash:before { content: 'sh'; }
pre.src-emacs-lisp:before { content: 'Emacs Lisp'; }
pre.src-R:before { content: 'R'; }
pre.src-perl:before { content: 'Perl'; }
pre.src-java:before { content: 'Java'; }
pre.src-sql:before { content: 'SQL'; }
table { border-collapse:collapse; }
caption.t-above { caption-side: top; }
caption.t-bottom { caption-side: bottom; }
td, th { vertical-align:top; }
th.org-right { text-align: center; }
th.org-left { text-align: center; }
th.org-center { text-align: center; }
td.org-right { text-align: right; }
td.org-left { text-align: left; }
td.org-center { text-align: center; }
dt { font-weight: bold; }
.footpara { display: inline; }
.footdef { margin-bottom: 1em; }
.figure { padding: 1em; }
.figure p { text-align: center; }
.inlinetask {
padding: 10px;
border: 2px solid gray;
margin: 10px;
background: #ffffcc;
}
#org-div-home-and-up
{ text-align: right; font-size: 70%; white-space: nowrap; }
textarea { overflow-x: auto; }
.linenr { font-size: smaller }
.code-highlighted { background-color: #ffff00; }
.org-info-js_info-navigation { border-style: none; }
#org-info-js_console-label
{ font-size: 10px; font-weight: bold; white-space: nowrap; }
.org-info-js_search-highlight
{ background-color: #ffff00; color: #000000; font-weight: bold; }
/*]]>*/-->
</style>
<script type="text/javascript">
/*
@licstart The following is the entire license notice for the
JavaScript code in this tag.
Copyright (C) 2012-2013 Free Software Foundation, Inc.
The JavaScript code in this tag is free software: you can
redistribute it and/or modify it under the terms of the GNU
General Public License (GNU GPL) as published by the Free Software
Foundation, either version 3 of the License, or (at your option)
any later version. The code is distributed WITHOUT ANY WARRANTY;
without even the implied warranty of MERCHANTABILITY or FITNESS
FOR A PARTICULAR PURPOSE. See the GNU GPL for more details.
As additional permission under GNU GPL version 3 section 7, you
may distribute non-source (e.g., minimized or compacted) forms of
that code without the copy of the GNU GPL normally required by
section 4, provided you include this license notice and a URL
through which recipients can access the Corresponding Source.
@licend The above is the entire license notice
for the JavaScript code in this tag.
*/
<!--/*--><![CDATA[/*><!--*/
function CodeHighlightOn(elem, id)
{
var target = document.getElementById(id);
if(null != target) {
elem.cacheClassElem = elem.className;
elem.cacheClassTarget = target.className;
target.className = "code-highlighted";
elem.className = "code-highlighted";
}
}
function CodeHighlightOff(elem, id)
{
var target = document.getElementById(id);
if(elem.cacheClassElem)
elem.className = elem.cacheClassElem;
if(elem.cacheClassTarget)
target.className = elem.cacheClassTarget;
}
/*]]>*///-->
</script>
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
displayAlign: "center",
displayIndent: "0em",
"HTML-CSS": { scale: 100,
linebreaks: { automatic: "false" },
webFont: "TeX"
},
SVG: {scale: 100,
linebreaks: { automatic: "false" },
font: "TeX"},
NativeMML: {scale: 100},
TeX: { equationNumbers: {autoNumber: "AMS"},
MultLineWidth: "85%",
TagSide: "right",
TagIndent: ".8em"
}
});
</script>
<script type="text/javascript"
src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS_HTML"></script>
</head>
<body>
<div id="content">
<h1 class="title">Communication-Avoiding Parallel Sparse-Dense Matrix Multiplication — The MPI Programming Assignment</h1>
<div id="table-of-contents">
<h2>Table of Contents</h2>
<div id="text-table-of-contents">
<ul>
<li><a href="#orgheadline1">1. Introduction</a></li>
<li><a href="#orgheadline2">2. Communication-Avoding Sparse-Dense Matrix Multiplication</a></li>
<li><a href="#orgheadline3">3. Specific requirements</a></li>
<li><a href="#orgheadline4">4. Input and output</a></li>
<li><a href="#orgheadline5">5. Solution content</a></li>
<li><a href="#orgheadline6">6. Scoring</a></li>
<li><a href="#orgheadline7">7. Additional materials</a></li>
<li><a href="#orgheadline8">8. FAQ</a></li>
</ul>
</div>
</div>
<p>
Due date: 8th June, 23:55 <del>5th June, 23:59</del>.
</p>
<p>
Version of this document: 10
</p>
<p>
changelog:
</p>
<ul class="org-ul">
<li>10: FAQ</li>
<li>9: FAQ</li>
<li>8: FAQ</li>
<li>7: FAQ, example tests renamed</li>
<li>6: FAQ</li>
<li>5: FAQ, example tests slightly cleaned (2 unneeded files deleted)</li>
<li>4: FAQ, another error in the Koanantakool paper, how matrix A should be sent by the coordinator;</li>
<li>3: FAQ, sample tests, performance guidelines;</li>
<li>2: FAQ, matrices A,B are square; slightly modified template;</li>
</ul>
<p>
This document last modified: 5/06/2016 16:30
</p>
<div id="outline-container-orgheadline1" class="outline-2">
<h2 id="orgheadline1"><span class="section-number-2">1</span> Introduction</h2>
<div class="outline-text-2" id="text-1">
<p>
As in a typical supercomputer communication takes orders of magnitude more time than computation, there is a growing new field in high-performance algorithms: communication-avoidance. These algorithms trade communication for redundant computation or increased memory usage.
</p>
</div>
</div>
<div id="outline-container-orgheadline2" class="outline-2">
<h2 id="orgheadline2"><span class="section-number-2">2</span> Communication-Avoding Sparse-Dense Matrix Multiplication</h2>
<div class="outline-text-2" id="text-2">
<p>
Your goal is to implement two parallel algorithms for multiplying a sparse matrix A by a dense matrix B: 1.5D Blocked Column A; and 1.5D Blocked Inner ABC - see the paper by Koanantakool et al.
The main idea of both algorithms is to replicate (redundantly) the data, and then to do less communication rounds, than if the data were not replicated.
Both algorithms organize processes into "replication" groups of size c (c is a parameter of the algorithm).
The Column A algorithm replicates just the sparse matrix A. The Inner ABC algorithm replicates all the matrices: the result C is gathered from the parts stored in the replication group.
</p>
<p>
Warning: there is an error in the description of 1.5D Blocked Inner ABC in Koanantakool et al's paper. The paper defines the C's blocks to be computed by process P(j,l) as \(C^{p/c \times p/c}(k,j) = A^{p/c \times 1}(k,:) B^{1 \times p/c} (:,j)\) where \(lq \leq k < (l+1)q\), however, this contradicts the description that follows and the example from Fig 3.b. The correct indexing is: \(k \mod p/c\) such that \(lq +j \leq k < (l+1)q + j\).
</p>
<p>
Warning: There is a similar error in the description of Column A: P(j) stores \(A^{1 \times p/c}(:,j \mod p/c)\) (which is consistent with Fig. 2b), and not \(A^{1 \times p/c}(:,j)\) (thanks: Michał Szostek).
</p>
</div>
</div>
<div id="outline-container-orgheadline3" class="outline-2">
<h2 id="orgheadline3"><span class="section-number-2">3</span> Specific requirements</h2>
<div class="outline-text-2" id="text-3">
<p>
Your goal is to implement multiple matrix multiplication, i.e., compute \(C = A ( A ( A ... ( A B ) ) )\). You cannot change the order of multiplications. The number of multiplications is given as a program parameter.
</p>
<p>
Your implementation must start from a data distribution for c = 1 (i.e., as if there is no replication). Using a generator we supply, processes generate the dense matrix B in parallel (our generator is stateless, so it might be used in parallel by multiple MPI processes; however, each element of the matrix must be generated exactly once).
</p>
<p>
Process 0 loads the sparse matrix A from a CSR file (see bibliography for the description of the format) and then sends it to other processes. Each process should receive only a part of the matrix that it will store for data distribution for c = 1 (the coordinator should not send redundant data).
</p>
<p>
Only after this initial data distribution, processes should contact their peers in replication groups and exchange their parts of matrices.
</p>
<p>
Your solution must use only MPI. Do not use any other parallel/distributed programming libraries (like cilk, pthreads, or OpenMP).
</p>
<p>
Do not use any libraries for local (inside a process) matrix multiplication.
</p>
<p>
Assume that A and B are square matrices.
</p>
<p>
Assume that, for performance testing, the number of rows and columns of A is much larger (at least thousands) than the number of MPI processes (at most hundreds). However, you cannot assume that the number of MPI processes divides the number of rows/columns (but you can extend matrices with 0 rows or columns, as long as they do not get printed in the final result).
</p>
</div>
</div>
<div id="outline-container-orgheadline4" class="outline-2">
<h2 id="orgheadline4"><span class="section-number-2">4</span> Input and output</h2>
<div class="outline-text-2" id="text-4">
<p>
Programs will be tested automatically. Please stick to the format below.
</p>
<p>
Your program will be run using the following instructions:
</p>
<p>
<code>cd xx123456; make clean; make</code>
</p>
<p>
<code>mpirun -np NP ./matrixmul -f sparse_matrix_file -s seed_for_dense_matrix -c repl_group_size -e exponent [-g ge_value] [-v] [-i]</code>
</p>
<p>
where:
</p>
<ul class="org-ul">
<li>NP is the number of processes (assume that NP is a multiplier of c);</li>
<li><code>sparse_matrix_file</code> is a CSR file storing the sparse matrix A. The first row contains 4 integers: the number of rows, the number of columns, the total number of non-zero elements, and the maximum number of non-zero elements in each row. The following 3 rows specify entries (array A in wikipedia's description); row offsets (array IA); and column indices (array JA).</li>
<li><code>-i</code> toggles the Inner algorithm (otherwise the Column algorithm is used);</li>
<li><code>-v</code> prints the matrix C (the multiplication result) in a row-major order: the first line specifies the number of rows and the number of columns of the result; i+1-th line is the i-th row of the matrix;</li>
<li><code>-c repl_group_size</code> specifies the number of processes in each replication group (i.e., parameter c from Koanantakool et al);</li>
<li><code>-e exponent</code> the number of multiplications to do, e.g., for <code>-e 3</code> your program should compute \(C = A ( A ( A B ) )\)</li>
<li><code>-g ge_value</code> prints the number of elements in C greater than or equal to the ~ge<sub>value</sub>_.</li>
</ul>
<p>
Do not print anything other than the matrix C (if <code>-v</code> is used) or a single integer (if <code>-g</code> is used) on stdout.
</p>
</div>
</div>
<div id="outline-container-orgheadline5" class="outline-2">
<h2 id="orgheadline5"><span class="section-number-2">5</span> Solution content</h2>
<div class="outline-text-2" id="text-5">
<p>
Please send us a single <code>.zip</code> file containing a single directory with your login (<code>ab123456</code>); the directory has at least the following files:
</p>
<ul class="org-ul">
<li><code>densematgen.h</code>: our generator. This file cannot be modified.</li>
<li><code>densematgen.c</code>: our generator. this file cannot be modified. For tests, we might use a different implementation of a generator (but it will be stateless);</li>
<li><code>matrixmul.c</code>: the main file of your solution</li>
<li><code>Makefile</code></li>
<li><code>report.pdf</code>: a report describing your implementation. Please describe the optimizations you implemented.</li>
</ul>
</div>
</div>
<div id="outline-container-orgheadline6" class="outline-2">
<h2 id="orgheadline6"><span class="section-number-2">6</span> Scoring</h2>
<div class="outline-text-2" id="text-6">
<ul class="org-ul">
<li>correct, parallel MPI implementation of the Inner algorithm: 5 points;</li>
<li>correct, parallel MPI implementation of the Column algorithm: 6 points;</li>
<li>report: 3 points (incorrect implementations do not get these points);</li>
<li>performance: 6 points (incorrect implementations do not get these points).</li>
</ul>
<p>
We will score correctness on our test data; we take into account the floating point errors.
</p>
<p>
We will score whether and how your implementation uses advanced MPI operations, like asynchronous messages, collectives, custom datatypes, custom communicators (hint: for communicating inside a replication group).
</p>
<p>
If you're late, your score will be reduced by 1 point for every 12 hours (i.e.: if you're late by 2h, we subtract 1 point from your score; if you're late by 25h, we subtract 3 points).
</p>
<p>
Due to problems with accessing ICM submission system, this year we are unable to give you access to a supercomputer. To optimize performance, please follow the guidelines given during the lecture, the labs and from Traff et al paper.
</p>
</div>
</div>
<div id="outline-container-orgheadline7" class="outline-2">
<h2 id="orgheadline7"><span class="section-number-2">7</span> Additional materials</h2>
<div class="outline-text-2" id="text-7">
<p>
Please do not use any source codes of matrix multiplication programs. We recommend reading the following documents:
</p>
<ul class="org-ul">
<li>P.Koanantakool et al, Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication, IPDPS 2016, <a href="http://www.eecs.berkeley.edu/~penpornk/spdm3_ipdps16.pdf">http://www.eecs.berkeley.edu/~penpornk/spdm3_ipdps16.pdf</a></li>
<li>Jesper Larsson Traff, William D. Gropp, and Rajeev Thakur, Self-Consistent MPI Performance Guidelines <a href="http://www-unix.mcs.anl.gov/~thakur/papers/tpds-consistency.pdf">http://www-unix.mcs.anl.gov/~thakur/papers/tpds-consistency.pdf</a></li>
<li>CSR matrix format: <a href="https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_row_.28CSR.2C_CRS_or_Yale_format.29">https://en.wikipedia.org/wiki/Sparse_matrix#Compressed_sparse_row_.28CSR.2C_CRS_or_Yale_format.29</a></li>
<li>template <a href="http://mimuw.edu.pl/~krzadca/ab123456.zip">http://mimuw.edu.pl/~krzadca/ab123456.zip</a></li>
<li>some tests <a href="http://mimuw.edu.pl/~krzadca/exported_tests.zip">http://mimuw.edu.pl/~krzadca/exported_tests.zip</a></li>
</ul>
</div>
</div>
<div id="outline-container-orgheadline8" class="outline-2">
<h2 id="orgheadline8"><span class="section-number-2">8</span> FAQ</h2>
<div class="outline-text-2" id="text-8">
<p>
(thanks: Jakub Sygnowski, Michał Szostek, Hubert Tarasiuk, Marcel Zięba, Tomasz Zakrzewski, Artur Kozak, Aleksander Kramarz, Paweł Krawczyk, Weronika Pecio, Mateusz Kitlas, Łukasz Gajowiak, Klaudia Algiz, Krzysztof Piecuch)
</p>
<ul class="org-ul">
<li>What language can I use?</li>
</ul>
<p>
C or C++.
</p>
<ul class="org-ul">
<li>Can I use <a href="https://github.com/google/googletest/tree/master/googletest">https://github.com/google/googletest/tree/master/googletest</a> for testing? Should I include such tests?</li>
</ul>
<p>
You can use external testing libraries, as long as your makefile will compile on students.mimuw.edu.pl . You can include such tests in your final solution (even if they use pthreads). However, your code can't use pthreads in implementation of the algorithms.
</p>
<ul class="org-ul">
<li>Can I modify the template?</li>
</ul>
<p>
Yes, you can! In particular, if you write in C++, you can write your own main module (and modify Makespan as you neeed).
</p>
<ul class="org-ul">
<li>What datatype should I use for computation?</li>
</ul>
<p>
Use doubles.
</p>
<ul class="org-ul">
<li>Do you really want us to, first, scatter A into as many unique pieces as there are processes, and only after that (if \(c>1\)) construct redundant A parts by communicating among processes? Why can't the coordinator send A partitioned for correct c?</li>
</ul>
<p>
Matrix multiplication (and other distributed) algorithms commonly assume that the data is partioned evenly among processes before the algorithm starts (see the first lecture on alpha-beta efficiency model). The communication-avoiding algorithms reorganize data and thus add new communication (before the multiplication really begins). We want to be able to measure the time needed for this additional communication, which is possible only when the algorithm firstly distributes the data for c=1 (thus, as if there was no communication-avoidance), and only after that reorganizes the data.
</p>
<ul class="org-ul">
<li>If I use C++, can I rename the main file of my solution to matrixmul.cpp or matrixmul.cc?</li>
</ul>
<p>
Yes, you can.
</p>
<ul class="org-ul">
<li>How are the example tests organized? — see the bottom of the page, the answer below is no longer valid</li>
</ul>
<p>
<code>result_X_000Y_Z</code> is a result of multiplying <code>sparse05_000Y_Z</code> with <code>matrix01_000Y_A</code> X times (i.e., with <code>-e X</code>); A is the (Z+1)th ordered seed of the dense matrix. For instance, in <code>result_2_00010_002</code>, we multiply <code>sparse05_00010_002</code> with <code>matrix01_00010_291</code> and then mutliply the result again by <code>sparse05_00010_002</code>, or, in other words, <code>./matrixmul -f sparse05_0010_002 -s 291 -e 2 -v</code>.
</p>
<ul class="org-ul">
<li>Can process 0 load the whole sparse matrix A, and only after that partition it and send parts to other processes?</li>
</ul>
<p>
Yes.
</p>
<ul class="org-ul">
<li>What is the c parameter? Number of processes in the replication group, or number of replication groups? And what is a replication group anyway?</li>
</ul>
<p>
In this document, a replication group is a group of processes storing the same data (after replication phase, before the computation phase). In the replication phase, each process contacts \((c-1)\) other processes; these processes combine the data they got from the initial data distribution.
</p>
<ul class="org-ul">
<li>Can I really code in C++? They say it's slower…</li>
</ul>
<p>
Yes, you can code in C++. In any language, make sure your solution is as optimized as possible (regarding both MPI usage and sequential code; however, for this assignment, MPI usage will count more than sequential optimization).
</p>
<ul class="org-ul">
<li>Can I assume that the result C will fit in the memory of a single node?</li>
</ul>
<p>
In general: if <code>-g</code> or no output option is used, you can't assume the result will fit in a single node. But you can assume that if <code>-v</code> is used, C will fit in a single node.
</p>
<ul class="org-ul">
<li>Can I assume that, when InnerABC is used, \(p\) is a multiply of \(c^2\)</li>
</ul>
<p>
Yes, you can.
</p>
<ul class="org-ul">
<li>Will you run our code on a cluster?</li>
</ul>
<p>
Most probably, yes. But the performance score will also (mostly) take into account how your code uses MPI (see the scoring section above).
</p>
<ul class="org-ul">
<li>How should the processes know the size of matrix A?</li>
</ul>
<p>
Process 0 can broadcast the size of A. In general, you can add messages to your implementation.
</p>
<ul class="org-ul">
<li>Should I add a barrier before <code>comm_start = MPI_Wtime();</code>?</li>
</ul>
<p>
Yes, the measurement will be more precise with a barrier before measuring the communication time.
</p>
<ul class="org-ul">
<li>Can I implement blocked row instead of blocked column?</li>
</ul>
<p>
No, please stick to the column algorithm.
</p>
<ul class="org-ul">
<li>Can I divide the A matrix into uneven parts to optimize the distribution of non-zero elements?</li>
</ul>
<p>
You can assume that in A, for each row, the columns storing non-zero elements of A are generated from a uniform distribution, thus, in expectation, each column of A has the same number of non-zero elements. But you can optimize the number of columns assigned to processes - however, if you do so, please make it an optional compilation option (that is switched on by default, but can be switched off).
</p>
<ul class="org-ul">
<li>How are the example tests organized?</li>
</ul>
<p>
<code>result_X_000Y_Z_A</code> is a result of multiplying <code>sparse05_000Y_Z</code> with <code>matrix01_000Y_A</code> X times (i.e., with <code>-e X</code>). For instance, in <code>result_2_00010_002_00291</code>, we multiply <code>sparse05_00010_002</code> with <code>matrix01_00010_291</code> and then mutliply the result again by <code>sparse05_00010_002</code>, or, in other words, <code>./matrixmul -f sparse05_0010_002 -s 291 -e 2 -v</code>.
</p>
<ul class="org-ul">
<li>Can a process temporary store a few (two, three) blocks of matrix A (e.g., when the matrix is shifted)?</li>
</ul>
<p>
Yes, it can.
</p>
<ul class="org-ul">
<li>How should the output be formatted?</li>
</ul>
<p>
See the description of <code>-v</code> in the Input/Output section. Additional instructions: white-spaces are not important (values can be separated by a single space); field formatting is not important; you should use a standard format for floating point numbers (eg. <code>12345.67890</code>) with at least 5 numbers after the dot.
</p>
<ul class="org-ul">
<li>Can I destroy B during computation?</li>
</ul>
<p>
Yes, you can (as long as the result is correct).
</p>
<ul class="org-ul">
<li>In Inner, at the end, should the processes send their parts of matrix to process in layer 0, or can they directly send the parts to the coordinator (or just the counts for <code>-g</code>)?</li>
</ul>
<p>
For fair measurements, please reduce first to layer 0, and only then send to the coordinator.
</p>
<ul class="org-ul">
<li>Can the coordinator allocate memory for 2|A|?</li>
</ul>
<p>
Yes, it can.
</p>
<ul class="org-ul">
<li>Can I use Boost.MPI?</li>
</ul>
<p>
Yes, you can.
</p>
<ul class="org-ul">
<li>Can we assume that the cluster nodes will use the same binary representation of numbers?</li>
</ul>
<p>
Yes, you can.
</p>
<ul class="org-ul">
<li>Can I use a different matrix representation for the sparse matrix A?</li>
</ul>
<p>
The input is in CSR, but you can use any internal (i.e., in memory) representation, including CSR. The representation has to be sparse, i.e., A cannot use \(O(n^2)\) memory.
</p>
</div>
</div>
</div>
<div id="postamble" class="status">
<p class="author">Author: Krzysztof Rzadca</p>
<p class="date">Created: 2016-06-05 Sun 09:20</p>
<p class="validation"><a href="http://validator.w3.org/check?uri=referer">Validate</a></p>
</div>
</body>
</html>