[Linux] converting PDF to DOC?

Larry Kagan linux@flux.org
Fri, 22 Jun 2007 15:13:44 -0400


This is a multi-part message in MIME format.
--------------010303040101080307090503
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Robert Citek wrote:
> On 06/22/2007 10:32 AM, Adam Glass wrote:
>   
>> You could print the PDF and then scan the pages and then run the OCR 
>> software.
>>     
>
> Would it be possible to directly convert the PDFs to whatever image
> format the OCR is expecting?  For example, using convert:
>
> $ convert f1040.pdf f1040.bmp
>
> and then load those bitmap files into the OCR.
>   
I replied quite a while ago with precise instructions on how to do this 
using convert or Gimp and gocr. 

 From the archives (dated: /Sun, 03 Jun 2007 21:53:33 -0400/)

There is one other option but it's not pretty.  You can use gocr 
(optical character recognition).  

   1. Open the PDF in Gimp.
   2. Save as PGM file format. (or use ImageMagic: $ convert mydoc.pdf
      mydoc.pgm)
   3. Run gocr on the pgm file ($ gocr mydoc.pgm > mydoc.txt)
   4. Open mydoc.txt in OO and save as MS word.
   5. Read and Fix all the characters not recognized properly (which
      could be quite a lot)
   6. Re-format the document (bullets, underlines, bold, italic, etc)
   7. Crop, save, and copy embedded images from PDF into the new doc file.

This is obviously a project and probably more work than it's worth but 
only you can decide that.

Good Luck

Larry



--------------010303040101080307090503
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
  <meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
  <title></title>
</head>
<body bgcolor="#ffffff" text="#000000">
Robert Citek wrote:
<blockquote cite="mid:467BF167.2090507@gmail.com" type="cite">
  <pre wrap="">On 06/22/2007 10:32 AM, Adam Glass wrote:
  </pre>
  <blockquote type="cite">
    <pre wrap="">You could print the PDF and then scan the pages and then run the OCR 
software.
    </pre>
  </blockquote>
  <pre wrap=""><!---->
Would it be possible to directly convert the PDFs to whatever image
format the OCR is expecting?  For example, using convert:

$ convert f1040.pdf f1040.bmp

and then load those bitmap files into the OCR.
  </pre>
</blockquote>
I replied quite a while ago with precise instructions on how to do this
using convert or Gimp and gocr.&nbsp; <br>
<br>
>From the archives (dated: <i>Sun, 03 Jun 2007 21:53:33 -0400</i>)<br>
<pre>There is one other option but it's not pretty.  You can use gocr 
(optical character recognition).  

   1. Open the PDF in Gimp.
   2. Save as PGM file format. (or use ImageMagic: $ convert mydoc.pdf
      mydoc.pgm)
   3. Run gocr on the pgm file ($ gocr mydoc.pgm &gt; mydoc.txt)
   4. Open mydoc.txt in OO and save as MS word.
   5. Read and Fix all the characters not recognized properly (which
      could be quite a lot)
   6. Re-format the document (bullets, underlines, bold, italic, etc)
   7. Crop, save, and copy embedded images from PDF into the new doc file.

This is obviously a project and probably more work than it's worth but 
only you can decide that.

Good Luck

Larry</pre>
<br>
</body>
</html>

--------------010303040101080307090503--