=================================================================== RCS file: /home/cvs/OpenXM/doc/ascm2001p/homogeneous-network.tex,v retrieving revision 1.1 retrieving revision 1.2 diff -u -p -r1.1 -r1.2 --- OpenXM/doc/ascm2001p/homogeneous-network.tex 2001/06/19 07:32:58 1.1 +++ OpenXM/doc/ascm2001p/homogeneous-network.tex 2001/06/20 01:43:12 1.2 @@ -1,4 +1,4 @@ -% $OpenXM$ +% $OpenXM: OpenXM/doc/ascm2001p/homogeneous-network.tex,v 1.1 2001/06/19 07:32:58 noro Exp $ \subsection{Distributed computation with homogeneous servers} \label{section:homog} @@ -9,129 +9,33 @@ not include communication between servers, one cannot the maximal parallel speedup. However it is possible to execute several types of distributed computation as follows. -\subsubsection{Product of univariate polynomials} - -Shoup \cite{Shoup} showed that the product of univariate polynomials -with large degrees and large coefficients can be computed efficiently -by FFT over small finite fields and Chinese remainder theorem, -which can be easily parallelized. -% -%\begin{tabbing} -%Input :\= $f_1, f_2 \in {\bf Z}[x]$ such that $deg(f_1), deg(f_2) < 2^M$\\ -%Output : $f = f_1f_2$ \\ -%$P \leftarrow$ \= $\{m_1,\cdots,m_N\}$ where $m_i$ is an odd prime, \\ -%\> $2^{M+1}|m_i-1$ and $m=\prod m_i $ is sufficiently large. \\ -%Separate $P$ into disjoint subsets $P_1, \cdots, P_L$.\\ -%for \= $j=1$ to $L$ $M_j \leftarrow \prod_{m_i\in P_j} m_i$\\ -%Compute $F_j$ such that $F_j \equiv f_1f_2 \bmod M_j$\\ -%\> and $F_j \equiv 0 \bmod m/M_j$ in parallel.\\ -%\> (The product is computed by FFT.)\\ -%return $\phi_m(\sum F_j)$\\ -%(For $a \in {\bf Z}$, $\phi_m(a) \in (-m/2,m/2)$ and $\phi_m(a)\equiv a \bmod m$) -%\end{tabbing} -% -Figure \ref{speedup} -shows the speedup factor under the above distributed computation -on Risa/Asir. For each $n$, two polynomials of degree $n$ -with 3000bit coefficients are generated and the product is computed. -The machine is FUJITSU AP3000, -a cluster of Sun workstations connected with a high speed network -and MPI over the network is used to implement OpenXM. -\begin{figure}[htbp] -\epsfxsize=10cm -\epsffile{speedup.ps} -\caption{Speedup factor} -\label{speedup} -\end{figure} -If the number of servers is $L$ and the inputs are fixed, then the cost to -compute the products modulo some integers in parallel is $O(1/L)$, -whereas the cost -to send and receive polynomials is $O(L)$ if {\tt ox\_push\_cmo()} and -{\tt ox\_pop\_cmo()} are repeatedly applied on the client. -Therefore the speedup is limited and the upper bound of -the speedup factor depends on the ratio of -the computational cost and the communication cost for each unit operation. -Figure \ref{speedup} shows that -the speedup is satisfactory if the degree is large and $L$ -is not large, say, up to 10 under the above environment. -If OpenXM provides collective operations for broadcast and reduction -such as {\tt MPI\_Bcast} and {\tt MPI\_Reduce} respectively, the cost of -broadcasting the inputs and gathering the results on the servers -may be reduced to $O(\log_2L)$ -and we can expect better results in such a case. In order to implement -such operations we need new specifications for inter-sever communication -and the session management, which will be proposed as OpenXM-RFC 102. -We note that preliminary experiments show the collective operations -work well on OpenXM. - -%\subsubsection{Competitive distributed computation by various strategies} -% -%SINGULAR \cite{Singular} implements {\it MP} interface for distributed -%computation and a competitive Gr\"obner basis computation is -%illustrated as an example of distributed computation. -%Such a distributed computation is also possible on OpenXM as follows: -% -%The client creates two servers and it requests -%Gr\"obner basis comutations from the homogenized input and the input itself -%to the servers. -%The client watches the streams by {\tt ox\_select()} -%and the result which is returned first is taken. Then the remaining -%server is reset. -% -%\begin{verbatim} -%/* G:set of polys; V:list of variables */ -%/* O:type of order; P0,P1: id's of servers */ -%def dgr(G,V,O,P0,P1) -%{ -% P = [P0,P1]; /* server list */ -% map(ox_reset,P); /* reset servers */ -% /* P0 executes non-homogenized computation */ -% ox_cmo_rpc(P0,"dp_gr_main",G,V,0,1,O); -% /* P1 executes homogenized computation */ -% ox_cmo_rpc(P1,"dp_gr_main",G,V,1,1,O); -% map(ox_push_cmd,P,262); /* 262 = OX_popCMO */ -% F = ox_select(P); /* wait for data */ -% /* F[0] is a server's id which is ready */ -% R = ox_get(F[0]); -% if ( F[0] == P0 ) { -% Win = "nonhomo"; Lose = P1; -% } else { -% Win = "homo"; Lose = P0; -% } -% ox_reset(Lose); /* reset the loser */ -% return [Win,R]; -%} -%\end{verbatim} - \subsubsection{Nesting of client-server communication} Under OpenXM-RFC 100 an OpenXM server can be a client of other servers. Figure \ref{tree} illustrates a tree-like structure of an OpenXM client-server communication. - \begin{figure} \label{tree} \begin{center} -\begin{picture}(200,140)(0,0) -\put(70,120){\framebox(40,15){client}} -\put(20,60){\framebox(40,15){server}} -\put(70,60){\framebox(40,15){server}} -\put(120,60){\framebox(40,15){server}} +\begin{picture}(200,70)(0,0) +\put(70,70){\framebox(40,15){client}} +\put(20,30){\framebox(40,15){server}} +\put(70,30){\framebox(40,15){server}} +\put(120,30){\framebox(40,15){server}} \put(0,0){\framebox(40,15){server}} \put(50,0){\framebox(40,15){server}} -\put(135,0){\framebox(40,15){server}} +\put(150,0){\framebox(40,15){server}} -\put(90,120){\vector(-1,-1){43}} -\put(90,120){\vector(0,-1){43}} -\put(90,120){\vector(1,-1){43}} -\put(40,60){\vector(-1,-2){22}} -\put(40,60){\vector(1,-2){22}} -\put(140,60){\vector(1,-3){14}} +\put(90,70){\vector(-2,-1){43}} +\put(90,70){\vector(0,-1){21}} +\put(90,70){\vector(2,-1){43}} +\put(40,30){\vector(-2,-1){22}} +\put(40,30){\vector(2,-1){22}} +\put(140,30){\vector(2,-1){22}} \end{picture} \caption{Tree-like structure of client-server communication} \end{center} \end{figure} - Such a computational model is useful for parallel implementation of algorithms whose task can be divided into subtasks recursively. @@ -242,3 +146,97 @@ itself. % % % + +\subsubsection{Product of univariate polynomials} + +Shoup \cite{Shoup} showed that the product of univariate polynomials +with large degrees and large coefficients can be computed efficiently +by FFT over small finite fields and Chinese remainder theorem. +It can be easily parallelized: + +\begin{tabbing} +Input :\= $f_1, f_2 \in {\bf Z}[x]$ such that $deg(f_1), deg(f_2) < 2^M$\\ +Output : $f = f_1f_2$ \\ +$P \leftarrow$ \= $\{m_1,\cdots,m_N\}$ where $m_i$ is an odd prime, \\ +\> $2^{M+1}|m_i-1$ and $m=\prod m_i $ is sufficiently large. \\ +Separate $P$ into disjoint subsets $P_1, \cdots, P_L$.\\ +for \= $j=1$ to $L$ $M_j \leftarrow \prod_{m_i\in P_j} m_i$\\ +Compute $F_j$ such that $F_j \equiv f_1f_2 \bmod M_j$\\ +\> and $F_j \equiv 0 \bmod m/M_j$ in parallel.\\ +\> (The product is computed by FFT.)\\ +return $\phi_m(\sum F_j)$\\ +(For $a \in {\bf Z}$, $\phi_m(a) \in (-m/2,m/2)$ and $\phi_m(a)\equiv a \bmod m$) +\end{tabbing} + +Figure \ref{speedup} +shows the speedup factor under the above distributed computation +on Risa/Asir. For each $n$, two polynomials of degree $n$ +with 3000bit coefficients are generated and the product is computed. +The machine is FUJITSU AP3000, +a cluster of Sun workstations connected with a high speed network +and MPI over the network is used to implement OpenXM. +\begin{figure}[htbp] +\epsfxsize=8.5cm +\epsffile{speedup.ps} +\caption{Speedup factor} +\label{speedup} +\end{figure} + +If the number of servers is $L$ and the inputs are fixed, then the cost to +compute $F_j$ in parallel is $O(1/L)$, whereas the cost +to send and receive polynomials is $O(L)$ if {\tt ox\_push\_cmo()} and +{\tt ox\_pop\_cmo()} are repeatedly applied on the client. +Therefore the speedup is limited and the upper bound of +the speedup factor depends on the ratio of +the computational cost and the communication cost for each unit operation. +Figure \ref{speedup} shows that +the speedup is satisfactory if the degree is large and $L$ +is not large, say, up to 10 under the above environment. +If OpenXM provides collective operations for broadcast and reduction +such as {\tt MPI\_Bcast} and {\tt MPI\_Reduce} respectively, the cost of +sending $f_1$, $f_2$ and gathering $F_j$ may be reduced to $O(\log_2L)$ +and we can expect better results in such a case. In order to implement +such operations we need new specifications for inter-sever communication +and the session management, which will be proposed as OpenXM-RFC 102. +We note that preliminary experiments show the collective operations +work well on OpenXM. + +%\subsubsection{Competitive distributed computation by various strategies} +% +%SINGULAR \cite{Singular} implements {\it MP} interface for distributed +%computation and a competitive Gr\"obner basis computation is +%illustrated as an example of distributed computation. +%Such a distributed computation is also possible on OpenXM as follows: +% +%The client creates two servers and it requests +%Gr\"obner basis comutations from the homogenized input and the input itself +%to the servers. +%The client watches the streams by {\tt ox\_select()} +%and the result which is returned first is taken. Then the remaining +%server is reset. +% +%\begin{verbatim} +%/* G:set of polys; V:list of variables */ +%/* O:type of order; P0,P1: id's of servers */ +%def dgr(G,V,O,P0,P1) +%{ +% P = [P0,P1]; /* server list */ +% map(ox_reset,P); /* reset servers */ +% /* P0 executes non-homogenized computation */ +% ox_cmo_rpc(P0,"dp_gr_main",G,V,0,1,O); +% /* P1 executes homogenized computation */ +% ox_cmo_rpc(P1,"dp_gr_main",G,V,1,1,O); +% map(ox_push_cmd,P,262); /* 262 = OX_popCMO */ +% F = ox_select(P); /* wait for data */ +% /* F[0] is a server's id which is ready */ +% R = ox_get(F[0]); +% if ( F[0] == P0 ) { +% Win = "nonhomo"; Lose = P1; +% } else { +% Win = "homo"; Lose = P0; +% } +% ox_reset(Lose); /* reset the loser */ +% return [Win,R]; +%} +%\end{verbatim} +