Sunday, November 29, 2015

OCR with Microsoft Office Document Imaging

This article will show how to integrate the Office 2007 OCR engine. This will perform the OCR on the scanned image and convert it to the text.

It's necessary that you have installed the Microsoft Office Document Imaging 12.0 Type Library. MS office setup doesn't install this component by default, being necessary to install it later. To do this:
  • Run the Office 2007 installation setup
  • Click on the button Add or Remove Features
  • Make sure that the component is installed

Add reference of Microsoft Office Document Imaging DLL in project by clicking on Add Reference from solution explorer. At the COM tab, select Microsoft Office Document Imaging 12.0 Type Library.

Code:

            MODI.Document md = new MODI.Document(); 
            md.Create("FileName.TIF"); 
            md.OCR(MODI.MiLANGUAGES.miLANG_ENGLISH, true, true); 
            MODI.Image image = (MODI.Image)md.Images[0];

            MODI.Layout layout = image.Layout;
           
            //create text file with the same Image file name 
            FileStream createFile = new FileStream("fileName.txt",FileMode.CreateNew);

            //save the image text in the text file 
            StreamWriter writeFile = new StreamWriter(createFile);
            writeFile.Write(image.Layout.Text); 

            writeFile.Close(); 


if you want to read text word by word then use below,

         for (int i = 0; i < layout.Words.Count; i++)
                    {
                        MODI.Word word = (MODI.Word)layout.Words[i];
                        string strText = word.Text;

                    }

No comments:

Post a Comment