inpainting of occluded regions in handwritingsoagm2011.joanneum.at/papers/49.pdf · inpainting of...

Inpainting of Occluded Regions in Handwritings∗

Fabian Hollaus and Robert Sablatnig

Institute of Computer Aided AutomationComputer Vision Lab

Vienna University of Technology{holl,sab}@caa.tuwien.ac.at

AbstractThis paper deals with the reconstruction of handwritings that are partially overlapped by other texts.The presented system utilizes a high-order Markov Random Field in order to learn image models,which capture the statistics of handwritten strokes. Different handwriting models are used for theretouching of missing stroke regions in English words and ancient Greek handwritings. The Greekwritings are overwritten by younger texts. The system makes use of multi-spectral images in order toseparate the overwritings from the underwritings and to restore the older texts.

1 IntroductionThe analysis of historic and degraded documents can be facilitated by multi-spectral imaging [12].Multi-spectral imaging has also proven its usefulness for the examination of palimpsests1 [12], [6].Palimpsests are historic manuscript pages that were reused. The original texts in such palimpsest werescraped off and overwritten by younger writings.

The palimpsest folios, utilized in this work, are provided by the Archimedes Palimpsest Project [6]and contain erased underwritings, which are almost invisible under red light. Easton et al. [6] dis-covered that the ancient writings are most visible under UltraViolet (UV) illumination and proposeda pseudo-color approach, which visualizes the over- and underwritings. Figure 1 illustrates that suchan pseudo-color image is obtained from a red light image in Figure 1 (a) and an UV image, given inFigure 1 (b). The resulting pseudo-color image is provided in Figure 1 (c).

The readability of those texts is limited, since the ancient texts are partially occluded by younger ones.The aim of this work is to detect the occluded regions and to restore them automatically. The systemconsists of two stages: Firstly, the overwritings are detected by applying a binarization algorithm onred light images. Secondly, the retouching of the detected regions is fulfilled by applying an imageinpainting algorithm on UV images.

The term image inpainting describes the automated filling of a specific image region. The region iscalled inpainting region, or domain, and is provided in the form of a binary mask.

Similar to [2], the presented system utilizes a Markov Random Field (MRF) for the training of animage model that captures the statistics of handwritings. The trained model is afterwards used forthe recovery of occluded characters. The learning and inpainting sequence are based on the Fields ofExperts (FoE) framework, which is suggested by Roth and Black in [13].

The paper is structured as follows. In the following section the basic ideas of other inpainting tech-niques are explained. In Section 3 an overview on the FoE framework is provided. Subsequently the

∗This work was supported by the Austrian Science Foundation (FWF) under grant P23133.1Greek: palimpsestos - scraped again

(a) (b) (c)Figure 1. Multi-spectral imaging reveals the underwritings in leaf 40r. (a) Only the overwriting is recognizableunder red light. (b) The underwriting is most visible under ultraviolet illumination. (c) The pseudo-color imageexhibits the older and younger text. [6]

detection of the overwritings is detailed in Section 4. An evaluation of the presented method is givenin Section 5. Finally, the last section contains concluding remarks and an outlook.

2 Related WorkThe term digital image inpainting was coined by Bertalmio et al. in [1]. The authors suggest fulfillingthe inpainting task by solving Partial Differential Equations (PDE). The basic idea of the algorithm isto smoothly complete isophote lines, arriving at the border of the inpainting region, from the outsideof the mask into the inner region. Another PDE based algorithm is proposed by Chan et al. [3]: Theysuggest solving the inpainting problem with the minimization of an energy function, namely TotalVariation (TV). Recently, a further TV based inpainting algorithm has been proposed by Dahl et al.[5]. Common to the aforesaid algorithms is that anisotropic diffusion is applied during the inpaintingsequence, in order to maintain edges. The methods are only able to recover the geometry of animage, whereas texture is not preserved. Therefore, those methods are named geometric inpaintingtechniques. Typical for algorithms, belonging to this group, is also the fact that they only work wellon narrow inpainting domains, whereas large regions are blurred out [9].

Recently other inpainting methods have been proposed, which overcome the disadvantages of geo-metric inpainting techniques. Those algorithms are called textural inpainting methods. A pioneeringwork is suggested by Criminisi et al. [4]. In this work the inpainting task is performed, by iterativelycopying image patches from the outside of the mask into the inside. Komodakis and Tziritas [9] stressout that the algorithm by Criminisi et al. is based on heuristics and propose instead to formulate theinpainting task as a global optimization problem. Therefore they suggest the use of an MRF, whichposes the filling task as a labeling problem. Contrary to geometrical methods, textural inpainting tech-niques do not blur out large and textured regions. Nonetheless, we have chosen a geometric inpaintingapproach for the recovery of handwritings, because images containing text are far less textured thannatural images.

Cao and Govindaraju [2] suggest an approach for the restoration of handwritings. The proposedmethod utilizes an patch-based and pairwise MRF for the binarization of a handwritten text and forthe subsequent retouching of occluded regions. The FoE framework utilizes in contrast a pixel-basedand high-order MRF for the reconstruction of gray-value images.

3 Fields of ExpertsAn MRF can be formulated as an undirected graphical model. Such a graph G = (V,E) consistsof nodes v ∈ V and edges e ∈ E that connect the nodes. With this graph G it is possible to modela d-dimensional random vector x. Each random variable xv is represented by a node v. The edges

describe the relationships between the nodes. A clique c ∈ C is a subset of neighboring nodes v. Foreach clique c there is a potential function fc existing that assigns a positive value to the clique. Thejoint distribution is defined as:

p(x) =1

Z

∏c∈C

fc(x(c)) (1)

where Z is a normalizing function. Depending on the maximal clique size there are two differentMRF types existing: Pairwise MRFs and high-order MRFs. In a pairwise MRF the largest clique sizeis two. If the clique size is larger than two, the graphical model is called a high-order MRF.

Model Definition Roth and Black propose a high-order MRF in [13]. Their model outperformssimple pairwise graphical models, because it is able to model long-range correlations in images, whilepairwise models are only able to model dependencies between direct neighboring pixels. Anotheradvantage over earlier MRFs is the fact that the parameters for the potential function are learnt froma training database. The potential functions that are used in the FoE framework are called experts andwere introduced by Hinton in [8]. The experts are defined as:

f(x(k)) = fPoE(x(k); Θ) =N∏i=1

φ(JTi x(k);αi), (2)

where x(k) is the k-th patch of an image x. The image patch x(k) is projected onto a linear filterJi. αi is a scalar value, which is named expert parameter. There are N filters and expert parametersexisting. In this work 3 × 3 filters were utilized and N was set to 8, as it is suggested by Roth andBlack for this particular filter size. The expert function φ is a heavy tailed Student t-distribution. Thisdistribution models the heavy tailed marginal distributions of natural images more accurately than aGaussian distribution. The utilized expert function is defined by:

φ(JTi x;αi) =

(1 +

1

2(JT

i x)2)−αi

(3)

Since the potential functions are now defined, the generic MRF prior in Equation 1 can be reformu-lated into the FoE prior:

pFoE(x; Θ) =1

Z(Θ)

K∏k=1

N∏i=1

φ(JTi x(k);αi), (4)

where K is the number of nodes - or respectively pixels - of an image x, Z(Θ) is a normalizingfunction for a concrete parameter set Θ and the other variables are defined as above.

LearningThe filters and the corresponding expert parameters are learned from a dataset that consists of 20000image patches. Since the training of an FoE model is a highly time-consuming task, the model istrained on 15 × 15 patches instead of whole images. This particular patch size was suggested byRoth and Black, in order to reduce the computational burden. The training algorithm is based on theMaximum Likelihood approach. Thus, the likelihood is maximized w.r.t. the parameter set Θ. Theinterested reader is referred to [13] for a detailed description of the training algorithm.

InferenceThe inference in the FoE framework is based on the Maximum A Posteriori approach. Since thecomputation of the exact inference is NP hard in general [13], an approximate technique is appliedinstead: In an iterative procedure, a gradient ascent is performed on the log of the posterior. Theposterior contains no likelihood term, since it is assumed that no observation is made inside theinpainting region. Thus the posterior depends just on the FoE prior and is defined by:

x(t+1) = x(t) + ηM

[N∑i=1

J−i ∗ψ′(Ji ∗ x(t);αi)

], (5)

where η is a user defined step size. The filter J−i is obtained by mirroring Ji around its center and tis the actual iteration of the inpainting process. A relatively large number of iterations is necessary toensure convergence. In our implementation 2500 iterations are carried out and the step size η is set to100. The function ψ′ is the derivate of the log of the Student t-distribution, defined in Equation 3.

Applying the FoE Model on HandwritingsThe learning sequence for handwritings is similar to the original FoE algorithm, despite the fact thatanother training database is utilized. A further difference to the training of natural images is the selec-tion of the training patches: The patches that are used for the training of a generic prior are randomlyextracted from input images. Another sampling strategy is chosen for the generation of the handwrit-ings training data: Each patch, which is sampled from a handwriting database, must contain a certainnumber of pixels that belong to a character stroke. Thus, the prior learns the statistics of the charac-ters, in place of the background. The prior that captures the statistics of ancient writings was trainedon 20000 patches, which were extracted from 40 UV images. A sampling in overwritten regions wasavoided in order to learn solely the statistics of the underwritings. The inpainting of handwritingsis also very similar to the restoration of the natural images, except one simple modification: In thestandard implementation the inpainting region is initialized with zero, while in this work the domainis initialized with the dominant background color. It was found that this simple measurement leads toa reduction of the convergence time and a better avoidance of smearing artifacts.

4 Palimpsest ReconstructionBefore the retouching of a palimpsest text may take place, it is necessary to identify the youngertext. This text is generally in a better condition than the underwriting, but the contrast betweenthe overwriting and the background is strongly varying. In order to extract a reliable inpaintingmask, a recently suggested binarization algorithm by Su et al. [14] was implemented. The method isespecially designed for the binarization of historical and degraded documents.

Extraction of the Inpainting MaskThe binarization algorithm starts with the construction of a contrast image. Each pixel in this imageencodes the difference between the maximum and the minimum intensity value in a local neighbor-hood. The contrast is especially high at the border of character strokes. In order to suppress thebackground variation, the absolute difference is normalized and the contrast image is defined by:

D(x, y) =fmax(x, y)− fmin(x, y)

fmax(x, y) + fmin(x, y) + ε(6)

where fmax(x, y) and fmin(x, y) denote the maximum and minimum gray value in a local neighbor-hood. In this work a window size of 3 × 3 is used. ε > 0 is an infinitely small value that avoids adivision by zero.

The contrast image D is afterwards binarized in order to detect high contrast pixels that are locatedat stroke boundaries. A pixel is marked as a high contrast pixel if its intensity D(x, y) exceedsa global threshold. This threshold is determined with Otsu’s [11] thresholding approach. At thispoint, pixels at the stroke boundaries are marked as high contrast pixels. A pixel that is in the nearof a stroke boundary is subsequently classified as a foreground pixel, if two requirements are met:Firstly, the pixel must have a certain number of high contrast pixels in its local neighborhood. In ourimplementation a pixel is marked as a foreground pixel if it has at least 4 high contrast pixels in aneighborhood window with a size of 11 × 11. The second condition for a foreground classificationis met, if the pixel intensity f(x, y) is smaller or equal than the mean intensity of the high contrastpixels (in the original image) in its neighborhood. The detected inpainting mask is afterwards slightlydilated. This post processing step is necessary, because otherwise pixels belonging to the border ofthe overwriting might not be covered by the mask. If such vestiges of the overwriting are touchingthe mask border, it is likely that the corruption is propagated into the inpainting domain.

5 ExperimentsIn the subsequent experiments the Peak-Signal-to-Noise Ratio (PSNR) is used to measure the simi-larity between inpainted images and corresponding ground truth images. The PSNR is defined by:

PSNR = 20log10

(255√MSE

), (7)

where MSE denotes the Mean Square Error (MSE) of two images. The PSNR values are given indecibels (dB).

Synthetic DataThe first experiment is conducted on words, which consist of Latin letters. The utilized words aretaken from the IAM handwriting database [10]. The performance of three different FoE models isevaluated: One prior captures the statistics of natural images. The remaining two priors are trained ondiverse handwritings. The first handwriting model is trained on images that are randomly extractedfrom the database, which contains mainly cursive written words. The second prior is instead trainedon uppercase letters, which have a different main orientation than the characters in the cursive words.

Two test sets have been generated. The first set contains 100 randomly chosen words, whereas thesecond set is comprised of 100 words that are written in uppercase. The masks are generated fromwords, which have similar sizes as the input images. Table 1 shows the average of the PSNR valuesthat are gained inside the inpainting mask. It can be seen that the handwritings models gain a higherperformance if the statistics of the test images are similar to the statistics of the patches, which havebeen used in the learning sequence. Hence, the weakest performance is achieved by the model that istrained on natural images. The last column of Table 1 shows numerical results that are gained by apublicly available inpainting algorithm [5], which is based on TV. It is obvious that the FoE algorithmis more suitable for the handwriting recovery than the TV based method, which is designed for naturalimages. This can also be seen in Figure 2, where outputs of both handwriting priors are shown, alongwith a result that is generated by the TV based algorithm.

Test set \Training set Mixed letters Capitals Natural Images TV based algorithm [5]Lower case letters 19.03 dB 18.78 dB 18.06 dB 15.05 dBUpper case letters 17.08 dB 17.37 dB 16.34 dB 15.23 dB

Table 1. Restoration performance of three different image priors and a TV inpainting technique.

(a) (b) (c) (d)Figure 2. Text restoration. (a) Input image. (b) Image restored using a prior, which has been trained mainly oncursive words. The retouched image looks more natural than the one in (c), which is obtained with a prior thatcaptures the statistics of uppercase letters. (d) Output of a TV inpainting algorithm.

Palimpsest ReconstructionFour test panels have been extracted from different parchments of the Archimedes palimpsest. Itturned out that the inpainting algorithm is limited to relatively small inpainting domains: The extendof the occluded palimpsest regions (at the original resolution of 700 Dots per Inch) is too large fora proper restoration, since character strokes are only restored at the border of the mask, whereas themask centers are blurred out. Therefore, the test panels had to be downsized from 2001× 2001 pixelsto 501 × 501 pixels. A ground truth set was generated manually. The PSNR values of the restoredpanels are given in Table 2. The similarity inside the computed inpainting region is evaluated, like inthe previous experiment. Additionally, the similarity of the entire image is provided, since the maskis not always covering the overwritings. The last two columns in the table show the numerical resultsthat are gained by the TV based algorithm [5]. The PSNR values that are achieved by this algorithmare significantly smaller than the PSNR values, which are gained by the FoE method.

Panel 99 verso is in a better condition than the other panels, which explains the relatively high PSNRvalues that are achieved inside the inpainting domain and on the entire image. The similarity val-ues inside the mask regions are higher than the PSNR values, which are gained on synthetic data.Nevertheless, it has to be mentioned that the contrast between the foreground and the background inthe palimpsest images is considerably smaller than in the synthetic images. Thus, the MSE is morebounded in the case of the ancient writings.

Test set Image - FoE Mask region - FoE Image - TV Mask region - TV40 recto 27.29 dB 23.35 dB 23.95 dB 19.66 dB48 verso 27.83 dB 24.00 dB 25.27 dB 20.70 dB58 verso 28.88 dB 24.31 dB 25.22 dB 20.42 dB99 verso 29.94 dB 26.90 dB 26.66 dB 22.51 dB

Table 2. Palimpsest recovery performance.

In Figure 3 a part of the 48 verso panel is presented, along with various interim results. The finalinpainting result is shown in Figure 3 (f) and the manually created ground truth image is given inFigure 3 (e). It can be seen that some gaps are successfully restored, while others are not alteredadequately. A few parts of the overwriting are not covered by the mask. Those remaining parts leadto the introduction of not existing structures.

One example for the introduction of artificial structures is illustrated in Figure 4. The overwriting isnot fully detected, which can be seen by comparing the inpainting problem - given in Figure 4 (a) -with the ground truth image - shown in Figure 4 (b). The FoE prior propagates the surrounding darkregions into the unknown image regions, as it can be seen in Figure 4 (c). The unknown regions are

(a) (b) (c)

(d) (e) (f)Figure 3. Portion of folio 48 verso. (a) Photography taken under red light. (b) The mask is generated fromthe former image. (c) UV image and superimposed inpainting mask. The mask is slightly dilated to prevent apropagation of remaining overwritings. (d) UV image. (e) Ground truth image. (f) Inpainting result.

filled more heterogeneously, compared to the TV based inpainting result, which is shown in Figure 4(d). This explains the relative weak performance - in terms of PSNR - of the TV based algorithm.

(a) (b) (c) (d)Figure 4. Portion of folio 99 verso, illustrating the sensitivity to neighboring dark regions. (a) Input image andsuperimposed mask, which is not entirely covering the overwritten text. (b) Ground truth image. (c) Inpaintedresult produced by an FoE prior. (d) Result of the TV based algorithm [5].

6 ConclusionThis paper introduced an approach for the automatic inpainting of occluded handwriting regions.The FoE framework has been utilized in order to learn image models that capture the statistics ofhandwritten text. It was shown that such an image model learns the main orientation of characterstrokes, which is a major characteristic of handwritten text.

The statistical inpainting method was also used for the reconstruction of underwritings in palimpsests.It turned out that the presented system has several weaknesses: Firstly, noise is often reinforcedduring the inpainting task, which leads to an introduction of not existing structures. Secondly, theinpainting approach is incapable of retouching the utilized palimpsest at the original resolution, sincestrokes belonging together are not connected. Therefore the images had to be downsized, which isinappropriate for a subsequent text analysis. Due to the described limitations, the developed techniqueis currently inapplicable for the automated retouching of palimpsest writings.

It has to be mentioned that the performance may be increased by using 5 × 5 filters, as it is noted

in [13]. Furthermore, Heess et al. [7] have shown that the inpainting quality can be improved byusing bimodal potential functions. These issues will be studied in our future work. It will alsobe investigated, whether a textural inpainting method is capable of overcoming the aforementioneddrawbacks.

References[1] M. Bertalmio, G. Sapiro, V. Caselles, and C. Ballester. Image inpainting. In SIGGraph-2000,

pages 417–424, 2000.

[2] H. Cao and V. Govindaraju. Preprocessing of Low-Quality Handwritten Documents UsingMarkov Random Fields. PAMI, 31(7):1184–1194, 2009.

[3] T.F. Chan and J.H. Shen. Mathematical models for local deterministic inpaintings. Cam report00-11, UCLA, 2000.

[4] A. Criminisi, P. Perez, and K. Toyama. Region Filling and Object Removal by Exemplar-BasedImage Inpainting. ICIP, 13(9):1200–1212, 2004.

[5] Joachim Dahl, Per Christian Hansen, Søren Holdt Jensen, and Tobias Lindstrøm Jensen. Algo-rithms and software for total variation image reconstruction via first-order methods. NumericalAlgorithms, 53(1):67–92, 2010.

[6] R.L. Easton, K.T. Knox, and W.A. Christens-Barry. Multispectral imaging of the Archimedespalimpsest. In 32nd Applied Imagery Pattern Recognition Workshop, pages 111 – 116, 2003.

[7] N. Heess, C.K.I. Williams, and G.E. Hinton. Learning generative texture models with extendedfields-of-experts. In British Machine Vision Conference, 2009.

[8] G. Hinton. Products of experts. In International Conference on Artificial Neural Networks,volume 1, pages 1–6, 1999.

[9] N. Komodakis and G. Tziritas. Image Completion Using Efficient Belief Propagation Via Prior-ity Scheduling and Dynamic Pruning. ICIP, 16(11):2649–2661, 2007.

[10] U.-V. Marti and H. Bunke. The IAM-database: an English sentence database for offline hand-writing recognition. International Journal on Document Analysis and Recognition, 5:39–46,2002.

[11] N. Otsu. A threshold selection method from gray-level histograms. IEEE Transactions onSystems, Man and Cybernetics, 9(1):62–66, 1979.

[12] K. Rapantzikos and C. Balas. Hyperspectral imaging: potential in non-destructive analysis ofpalimpsests. In ICIP, volume 2, pages 618–621, 2005.

[13] S. Roth and M. J. Black. Fields of Experts: A Framework for Learning Image Priors. In CVPR,pages 860–867, 2005.

[14] B. Su, S. Lu, and C. L. Tan. Binarization of historical document images using the local maximumand minimum. In Proceedings of the 9th IAPR International Workshop on Document AnalysisSystems, pages 159–166, 2010.