Suggestion: OCR plugin wrapper for open-source Tesseract-OCR
Description
Environment
Gliffy Diagrams
Activity
endolith May 24, 2014 at 1:18 AM
This is amazing, thanks. I wish it could be built in and easier to set up though.
Could also use this to copy to clipboard: http://superuser.com/a/231032
Robin Krom February 6, 2013 at 6:51 PM
*Milestone*: Next_Release --> None
Wikinaut November 28, 2012 at 6:42 AM
The output text file of my script of the previous post is
C:\WINDOWS\TEMP\ocr.txt
Wikinaut November 28, 2012 at 6:41 AM
I added code in posting #7 in https://sourceforge.net/projects/greenshot/forums/forum/676082/topic/5574624/index/page/1
The code of an external command file which works is:
@ECHO OFF
REM OCR 20121128
REM batch resize images 20121014
IF (%1)==() GOTO HELP
SET LANG=%~2
IF (%2)==() SET LANG=deu
SET TMPFILE=%TMP%\ocr.tiff
setlocal EnableDelayedExpansion
@ECHO OCRing %~1 (%LANG%) ==^> ocr.txt
@C:\Programme\ImageMagick\convert -resize "400%%" -type Grayscale +compress "%~1" %TMPFILE%
@C:\Programme\tesseract-OCR\tesseract %TMPFILE% %TMP%\ocr -l %LANG%
type ocr.txt
GOTO EOF
:HELP
@ECHO:
@ECHO OCR image
@ECHO:
@ECHO Usage^: ocr x.jpg [deu^|eng]
@ECHO: default deu
@ECHO:
:EOF
Wikinaut November 21, 2012 at 8:31 PM
by the way, MODI is really good. But MODI is not free and cannot be installed on all machines. This is why I looked for a free alternative. Together with Greenshot, these two are perhaps good twins...
Greenshot 1.0 comes with a plugin (wrapper) for Microsoft Document Imaging MODI OCR - if this is available on the machine. MODI however requires a Microsoft Office license and at least partial installation of the required modules and languages. Often, not all of the required language packs are available. Not everyone like the MODI, even when this has a high detection quality.
Suggestion:
=========
* implementation of an alternative plugin as a wrapper for the open-source OCR command-line software "Tesseract"
* it requires also imagemagick (convert) for preprocessing images (resizing +300%, conversion to tiff)
* advantage: almost any language is available
* advantage: open-source
References:
* https://code.google.com/p/tesseract-ocr/
* https://de.wikipedia.org/wiki/Tesseract\_%28Software%29
* https://en.wikipedia.org/wiki/Tesseract\_%28software%29