Preprocessing poorly scanned handwritten digits

16

4

I have a few thousands of PDF files containing B&W images (1bit) from digitalized paper forms. I'm trying to OCR some fields, but sometime the writing is too faint:

enter image description here

I've just learned about morphological transforms. They are really cool!!! I feel like I'm abusing them (like I did with regular expressions when I learned Perl).

I'm only interested in the date, 07-06-2017:

im = cv2.blur(im, (5, 5))
plt.imshow(im, 'gray')

enter image description here

ret, thresh = cv2.threshold(im, 250, 255, 0)
plt.imshow(~thresh, 'gray')

enter image description here

People filling this form seems to have some disregard for the grid, so I tried to get rid of it. I'm able to isolate the horizontal line with this transform:

horizontal = cv2.morphologyEx(
    ~thresh, 
    cv2.MORPH_OPEN, 
    cv2.getStructuringElement(cv2.MORPH_RECT, (100, 1)),
)
plt.imshow(horizontal, 'gray')

enter image description here

I can get the vertical lines as well:

plt.imshow(horizontal ^ ~thresh, 'gray')

ret, thresh2 = cv2.threshold(roi, 127, 255, 0)
vertical = cv2.morphologyEx(
    ~thresh2, 
    cv2.MORPH_OPEN, 
    cv2.getStructuringElement(cv2.MORPH_RECT, (2, 15)), 
    iterations=2
)
vertical = cv2.morphologyEx(
    ~vertical, 
    cv2.MORPH_ERODE, 
    cv2.getStructuringElement(cv2.MORPH_RECT, (9, 9))
)
horizontal = cv2.morphologyEx(
    ~horizontal, 
    cv2.MORPH_ERODE, 
    cv2.getStructuringElement(cv2.MORPH_RECT, (7, 7))
)
plt.imshow(vertical & horizontal, 'gray')

enter image description here

Now I can get rid of the grid:

plt.imshow(horizontal & vertical & ~thresh, 'gray')

enter image description here

The best I got was this, but the 4 is still split into 2 pieces:

plt.imshow(cv2.morphologyEx(im2, cv2.MORPH_CLOSE, 
    cv2.getStructuringElement(cv2.MORPH_ELLIPSE, (5, 5))), 'gray')

enter image description here

Probably at this point it is better to use cv2.findContours and some heuristic in order to locate each digit, but I was wondering:

  1. should I give up and demand all documents to be rescanned in grayscale?
  2. are there better methods in order to isolate and locate the faint digits?
  3. do you know any morphological transform to join cases like the "4"?

[update]

Is rescanning the documents too demanding? If it is no great trouble I believe it is better to get higher quality inputs than training and trying to refine your model to withstand noisy and atypical data

A bit of context: I'm a nobody working at a public agency in Brazil. The price for ICR solutions start in the 6 digits so nobody believes a single guy can write an ICR solution in-house. I'm naive enough to believe I can prove them wrong. Those PDF documents were sitting at an FTP server (about 100K files) and were scanned just to get rid of the dead tree version. Probably I can get the original form and scan again myself but I would have to ask for some official support - since this is the public sector I would like to keep this project underground as much as I can. What I have now is an error rate of 50%, but if this approach is a dead end there is no point trying to improve it.

Paulo Scardine

Posted 2017-07-10T21:04:08.090

Reputation: 38 629

Answers

8

Maybe with Active contour model of some sort? For example, I found this library: https://github.com/pmneila/morphsnakes

Took your final "4" number:

enter image description here

After some quick tweaking (without actually understanding the parameters, so it may be possible to get a better result) I got this:

enter image description here

with the following code (I also hacked into the morphsnakes.py a little to save the images):

import morphsnakes

import numpy as np
from scipy.misc import imread
from matplotlib import pyplot as ppl

def circle_levelset(shape, center, sqradius, scalerow=1.0):
    """Build a binary function with a circle as the 0.5-levelset."""
    grid = np.mgrid[list(map(slice, shape))].T - center
    phi = sqradius - np.sqrt(np.sum((grid.T)**2, 0))
    u = np.float_(phi > 0)
    return u

#img = imread("testimages/mama07ORI.bmp")[...,0]/255.0
img = imread("four.png")[...,0]/255.0

# g(I)
gI = morphsnakes.gborders(img, alpha=900, sigma=3.5)

# Morphological GAC. Initialization of the level-set.
mgac = morphsnakes.MorphGAC(gI, smoothing=1, threshold=0.29, balloon=-1)
mgac.levelset = circle_levelset(img.shape, (39, 39), 39)

# Visual evolution.
ppl.figure()
morphsnakes.evolve_visual(mgac, num_iters=50, background=img)

Headcrab

Posted 2017-07-10T21:04:08.090

Reputation: 2 907