[Linux] converting PDF to DOC?
Larry Kagan
linux@flux.org
Fri, 22 Jun 2007 15:13:44 -0400
This is a multi-part message in MIME format.
--------------010303040101080307090503
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Robert Citek wrote:
> On 06/22/2007 10:32 AM, Adam Glass wrote:
>
>> You could print the PDF and then scan the pages and then run the OCR
>> software.
>>
>
> Would it be possible to directly convert the PDFs to whatever image
> format the OCR is expecting? For example, using convert:
>
> $ convert f1040.pdf f1040.bmp
>
> and then load those bitmap files into the OCR.
>
I replied quite a while ago with precise instructions on how to do this
using convert or Gimp and gocr.
From the archives (dated: /Sun, 03 Jun 2007 21:53:33 -0400/)
There is one other option but it's not pretty. You can use gocr
(optical character recognition).
1. Open the PDF in Gimp.
2. Save as PGM file format. (or use ImageMagic: $ convert mydoc.pdf
mydoc.pgm)
3. Run gocr on the pgm file ($ gocr mydoc.pgm > mydoc.txt)
4. Open mydoc.txt in OO and save as MS word.
5. Read and Fix all the characters not recognized properly (which
could be quite a lot)
6. Re-format the document (bullets, underlines, bold, italic, etc)
7. Crop, save, and copy embedded images from PDF into the new doc file.
This is obviously a project and probably more work than it's worth but
only you can decide that.
Good Luck
Larry
--------------010303040101080307090503
Content-Type: text/html; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="text/html;charset=ISO-8859-1" http-equiv="Content-Type">
<title></title>
</head>
<body bgcolor="#ffffff" text="#000000">
Robert Citek wrote:
<blockquote cite="mid:467BF167.2090507@gmail.com" type="cite">
<pre wrap="">On 06/22/2007 10:32 AM, Adam Glass wrote:
</pre>
<blockquote type="cite">
<pre wrap="">You could print the PDF and then scan the pages and then run the OCR
software.
</pre>
</blockquote>
<pre wrap=""><!---->
Would it be possible to directly convert the PDFs to whatever image
format the OCR is expecting? For example, using convert:
$ convert f1040.pdf f1040.bmp
and then load those bitmap files into the OCR.
</pre>
</blockquote>
I replied quite a while ago with precise instructions on how to do this
using convert or Gimp and gocr. <br>
<br>
>From the archives (dated: <i>Sun, 03 Jun 2007 21:53:33 -0400</i>)<br>
<pre>There is one other option but it's not pretty. You can use gocr
(optical character recognition).
1. Open the PDF in Gimp.
2. Save as PGM file format. (or use ImageMagic: $ convert mydoc.pdf
mydoc.pgm)
3. Run gocr on the pgm file ($ gocr mydoc.pgm > mydoc.txt)
4. Open mydoc.txt in OO and save as MS word.
5. Read and Fix all the characters not recognized properly (which
could be quite a lot)
6. Re-format the document (bullets, underlines, bold, italic, etc)
7. Crop, save, and copy embedded images from PDF into the new doc file.
This is obviously a project and probably more work than it's worth but
only you can decide that.
Good Luck
Larry</pre>
<br>
</body>
</html>
--------------010303040101080307090503--