results.tex

\section{Results}

As input our system takes a source video and a single target image of a face. It generates a dynamic texture of the target image for a 3D model and retargets the source mesh onto the single input image, inferring details such as wrinkles and the inner mouth region.  Our results can be seen in Fig.~\ref{fig:result}. More results can be seen in the supplementary material.


\paragraph{Reenactment: Comparison To Previous Works}

When given only a single image as a target input, \cite{f2f} can only generate static textures and does not capture any of the wrinkles or inner mouth details that our system captures as seen in Fig.~\ref{fig:result}. 

An example of methods like \cite{f2f} that do not use our inference model, thereby producing less detailed results when there is only a single target image as input, can be seen in Fig.~\ref{fig:wrinkles}.


\begin{figure}[th]
	\centering
	\includegraphics[width=1.0\linewidth]{figures/wrinkles/examples.pdf}
	\caption{Wrinkle animations from a single neutral target input image. In each row, the pair of images on the left shows the facial animation achieved using a static texture generated by a multilinear fitting such as~\cite{f2f}. The pair of images on the right shows the facial animation achieved using the dynamic texture generated by our inference model. Note that the images on the right are capable of generating detailed wrinkles and filling the inner mouth cavity.}~\label{fig:wrinkles}
	\vspace{-0.2in}
\end{figure}

\paragraph{Quantitative Evaluation}

To further validate our approach, we performed quantitative evaluation against an alternative, more direct approach to dynamic texture synthesis. Given an image of the source subject making a facial expression and a and neutral frame of the same subject, we apply the difference between these two images to the neutral frame of the target subject. We performed this test for 20 expression sequences, each 900 frames long, in which each of 5 test subjects are retargeted to one another. The test sequences have been synchronized in the same manner as the training data, thereby providing ground truth expressions for each retargeted sequence. The results, shown in Table ~\ref{table:paris}, demonstrate that our method synthesizes textures that are closer to the ground truth texture data. Furthermore, we note that while this more naive approach to texture synthesis (which we call ``direct delta transfer'') is simpler to implement and can synthesize wrinkles and other details on the target subject, simply applying these differences to the target subject does not account for geometric dissimilarity in the faces of the subjects, and thus it typically generates results that are far more uncanny and implausible than our approach.

\begin{table}[h!]
\begin{center}
\resizebox{.45\textwidth}{!}{%
  \begin{tabular}{ l  c  c c}
    \hline
    Method & Mean L1 Loss & Mean L2 Loss & SSIM \\ \hline
    \emph{Our result} & 1360 &  152 & 0.8730 dB\\ \hline
    \emph{Direct delta transfer} & 1790  & 211 & 0.8150 dB\\ \hline
\end{tabular}}
\end{center}
\caption{Quantitative evaluation against a more direct texture synthesis approach.}
\vspace{-0.2in}
\label{table:paris}
\end{table}


\input{result_fig}