python - Split text lines in scanned document -


i trying find way break split lines of text in scanned document has been adaptive thresholded. right now, storing pixel values of document unsigned ints 0 255, , taking average of pixels in each line, , split lines ranges based on whether average of pixels values larger 250, , take median of each range of lines holds. however, methods fails, there can black splotches on image.

is there more noise-resistant way task?

edit: here code. "warped" name of original image, "cuts" want split image.

warped = threshold_adaptive(warped, 250, offset = 10) warped = warped.astype("uint8") * 255  # areas can split image on whitespace make ocr more accurate color_level = np.array([np.sum(line) / len(line) line in warped]) cuts = [] = 0 while(i < len(color_level)):     if color_level[i] > 250:         begin =         while(color_level[i] > 250):             += 1         cuts.append((i + begin)/2) # middle of whitespace region     else:         += 1 

edit 2: sample image added enter image description here

from input image, need make text white, , background black

enter image description here

you need compute rotation angle of bill. simple approach find minarearect of white points (findnonzero), , get:

enter image description here

then can rotate bill, text horizontal:

enter image description here

now can compute horizontal projection (reduce). can take average value in each line. apply threshold th on histogram account noise in image (here used 0, i.e. no noise). lines background have value >0, text lines have value 0 in histogram. take average bin coordinate of each continuous sequence of white bins in histogram. y coordinate of lines:

enter image description here

here code. it's in c++, since of work opencv functions, should easy convertible python. @ least, can use reference:

#include <opencv2/opencv.hpp> using namespace cv; using namespace std;  int main() {     // read image     mat3b img = imread("path_to_image");      // binarize image. text white, background black     mat1b bin;     cvtcolor(img, bin, color_bgr2gray);     bin = bin < 200;      // find white pixels     vector<point> pts;     findnonzero(bin, pts);      // rotated rect of white pixels     rotatedrect box = minarearect(pts);     if (box.size.width > box.size.height)     {         swap(box.size.width, box.size.height);         box.angle += 90.f;     }      point2f vertices[4];     box.points(vertices);      (int = 0; < 4; ++i)     {         line(img, vertices[i], vertices[(i + 1) % 4], scalar(0, 255, 0));     }      // rotate image according found angle     mat1b rotated;     mat m = getrotationmatrix2d(box.center, box.angle, 1.0);     warpaffine(bin, rotated, m, bin.size());      // compute horizontal projections     mat1f horproj;     reduce(rotated, horproj, 1, cv_reduce_avg);      // remove noise in histogram. white bins identify space lines, black bins identify text lines     float th = 0;     mat1b hist = horproj <= th;      // mean coordinate of white white pixels groups     vector<int> ycoords;     int y = 0;     int count = 0;     bool isspace = false;     (int = 0; < rotated.rows; ++i)     {         if (!isspace)         {             if (hist(i))             {                 isspace = true;                 count = 1;                 y = i;             }         }         else         {             if (!hist(i))             {                 isspace = false;                 ycoords.push_back(y / count);             }             else             {                 y += i;                 count++;             }         }     }      // draw line final result     mat3b result;     cvtcolor(rotated, result, color_gray2bgr);     (int = 0; < ycoords.size(); ++i)     {         line(result, point(0, ycoords[i]), point(result.cols, ycoords[i]), scalar(0, 255, 0));     }      return 0; } 

Comments