Character Recognition Program that's Word-Compatible
Thread poster: BrianHayden
BrianHayden
BrianHayden
United States
Russian to English
Jan 2, 2014

Is there anyway I could scan the pages of a dictionary, then convert them into a (massive) file on Word? If so, what would be the cheapest and simplest way?

 
Vadim Kadyrov
Vadim Kadyrov  Identity Verified
Ukraine
Local time: 11:20
English to Russian
+ ...
Yes, you can Jan 2, 2014

The best application (I believe) is Abbyy Finereader (you can use the 8th version, it should be much cheaper than the newest one). You just scan pages into jpeg files and then use this application to OCR the images.

Still, this is an extremely time-consuming task. Even the best OCR applications won`t be able to perfectly reproduce the layout of dictionary pages.


 
BrianHayden
BrianHayden
United States
Russian to English
TOPIC STARTER
More detail... Jan 2, 2014

I should probably better explain what my plan -- feasible or unfeasible though it may be -- is. I like Microsoft Word, and I think it's fairly straightforward to use. I've been keeping a dictionary of idioms as a Word file, adding new entries as I encounter new new idioms. Keeping a dictionary of idioms and phrases in a Word file is especially convenient, since you can do a Ctrl + F search for a word within the phrase, which is easier than looking through all the words of an idiom separately in ... See more
I should probably better explain what my plan -- feasible or unfeasible though it may be -- is. I like Microsoft Word, and I think it's fairly straightforward to use. I've been keeping a dictionary of idioms as a Word file, adding new entries as I encounter new new idioms. Keeping a dictionary of idioms and phrases in a Word file is especially convenient, since you can do a Ctrl + F search for a word within the phrase, which is easier than looking through all the words of an idiom separately in a standard dictionary, which still may not list the idiom. I've recently found an especially good dictionary with a lot of idioms -- and I wanted to scan that in and add it to the Word file, somehow. Hand-typing the entries from the dictionary would be murderous. Anything that would be less laborious than hand-typing is okay in my book.

And I forgot to mention that I need a program that can read Cyrillic -- since this is a dictionary, I also need a program that can read Cyrillic with accent marks. Does Abby FineReader do that? And is it user-friendly?

[Edited at 2014-01-02 08:38 GMT]

[Edited at 2014-01-02 08:39 GMT]

[Edited at 2014-01-02 08:39 GMT]
Collapse


 
Rolf Keller
Rolf Keller
Germany
Local time: 10:20
English to German
OCR needs know-how Jan 2, 2014

[quote]Vadim Kadyrov wrote:

You just scan pages into jpeg files and then use this application to OCR the images.


This is possible, but must be done cautiously. JPG files can (and are if you use default settings) be non-lossless compressed, so that the OCR results will not be optimal. BTW, any OCR application should be able to use scanner input directly – no need to scan beforehand.

Even the best OCR applications won`t be able to perfectly reproduce the layout of dictionary pages.


??? For the mentioned purpose, you probably don't want to reproduce the original layout but a clean table (one table row per dictionary item).

In the worst case you have to mark up the columns manually (in the OCR software) and ignore all the remaining. Such markup takes about 30 seconds per page, so 240 pages take 2 hours. In many cases the OCR software will do that automatically, though.

Depending on the dictionary you might have to write a Word macro that tidies up the resulting Word table. This might take one hour or one day.


 
Vadim Kadyrov
Vadim Kadyrov  Identity Verified
Ukraine
Local time: 11:20
English to Russian
+ ...
The thing I suggested Jan 2, 2014

[quote]Rolf Keller wrote:

Vadim Kadyrov wrote:

You just scan pages into jpeg files and then use this application to OCR the images.


This is possible, but must be done cautiously. JPG files can (and are if you use default settings) be non-lossless compressed, so that the OCR results will not be optimal. BTW, any OCR application should be able to use scanner input directly – no need to scan beforehand.

Even the best OCR applications won`t be able to perfectly reproduce the layout of dictionary pages.


??? For the mentioned purpose, you probably don't want to reproduce the original layout but a clean table (one table row per dictionary item).

In the worst case you have to mark up the columns manually (in the OCR software) and ignore all the remaining. Such markup takes about 30 seconds per page, so 240 pages take 2 hours. In many cases the OCR software will do that automatically, though.

Depending on the dictionary you might have to write a Word macro that tidies up the resulting Word table. This might take one hour or one day.



The thing I suggested is a general scenario, with all the details to be discussed (or suggested) later on. The thing I assumed when I saw the message of the topic starter was his wish to reproduce the hard copy of the dictionary in electronic form (ok, some old and really precious edition of this dictionary).

In case he wants only some entries from this dictionary to be digitalized, the task becomes much easier, of course.

Some words about jpeg images. In case the resolution is high, quality-related issues of this file type no longer matter, I believe.

But these are details. I think the topic starter has already seen the "path".


 
esperantisto
esperantisto  Identity Verified
Local time: 11:20
Member (2006)
English to Russian
+ ...
SITE LOCALIZER
No Jan 2, 2014

BrianHayden wrote:

Keeping a dictionary of idioms and phrases in a Word file is especially convenient, since you can do a Ctrl + F search


If you use one dictionary, this may be fine. However, a translator normally needs more than one dictionary. In such a case, using a dictionary shell is a better solution. My favorite is GoldenDict.

program that can read Cyrillic with accent marks. Does Abby FineReader do that?


No, FineReader can't produce good output for accented Cyrillic letters. The versions 8 or 9 simply produce unaccented letters, the later 10 and 11 produce recognition errors.

[Edited at 2014-01-02 11:38 GMT]


 
BrianHayden
BrianHayden
United States
Russian to English
TOPIC STARTER
Dictionary Shell? Jan 2, 2014

esperantisto wrote:

BrianHayden wrote:

Keeping a dictionary of idioms and phrases in a Word file is especially convenient, since you can do a Ctrl + F search


If you use one dictionary, this may be fine. However, a translator normally needs more than one dictionary. In such a case, using a dictionary shell is a better solution. My favorite is GoldenDict.

program that can read Cyrillic with accent marks. Does Abby FineReader do that?


No, FineReader can't produce good output for accented Cyrillic letters. The versions 8 or 9 simply produce unaccented letters, the later 10 and 11 produce recognition errors.

[Edited at 2014-01-02 11:38 GMT]


What is a dictionary shell?


 
BrianHayden
BrianHayden
United States
Russian to English
TOPIC STARTER
Accent marks... Jan 2, 2014

No, FineReader can't produce good output for accented Cyrillic letters. The versions 8 or 9 simply produce unaccented letters, the later 10 and 11 produce recognition errors.

[Edited at 2014-01-02 11:38 GMT] [/quote]

Is there any way around that? It seems that a product that complicated would have some sort of way of dealing with that, especially since in Russian accent marks are occasionally used to disambiguate words in everyday, non-dictionary texts (think of за́мок
... See more
No, FineReader can't produce good output for accented Cyrillic letters. The versions 8 or 9 simply produce unaccented letters, the later 10 and 11 produce recognition errors.

[Edited at 2014-01-02 11:38 GMT] [/quote]

Is there any way around that? It seems that a product that complicated would have some sort of way of dealing with that, especially since in Russian accent marks are occasionally used to disambiguate words in everyday, non-dictionary texts (think of за́мок, замо́к).
Collapse


 
esperantisto
esperantisto  Identity Verified
Local time: 11:20
Member (2006)
English to Russian
+ ...
SITE LOCALIZER
Answers Jan 3, 2014

BrianHayden wrote:

What is a dictionary shell?


Well, a dictionary program. A program used to access dictionaries.

BrianHayden wrote:

Is there any way around that? It seems that a product that complicated would have some sort of way of dealing with that, especially since in Russian accent marks are occasionally used to disambiguate words in everyday, non-dictionary texts (think of за́мок, замо́к).


No idea. FineReader can be trained to recognize specific languages with specific characters, but I don’t know if it’s applicable to Russian accents as there are no pre-composed accented Cyrillic letters in Unicode.


 
Emma Goldsmith
Emma Goldsmith  Identity Verified
Spain
Local time: 10:20
Member (2004)
Spanish to English
Russian is in the drop-down list of languages in Abbyy Jan 3, 2014

esperantisto wrote:

No idea. FineReader can be trained to recognize specific languages with specific characters, but I don’t know if it’s applicable to Russian accents as there are no pre-composed accented Cyrillic letters in Unicode.


I've got no idea either, but Russian is definitely included in the list of languages that Abbyy will recognise. (Version 11.0)

You can also add a host of symbols/letters as a "user language". For example, I've added µ, α and β because Abbyy doesn't recognise them out of the box.


 


To report site rules violations or get help, contact a site moderator:


You can also contact site staff by submitting a support request »

Character Recognition Program that's Word-Compatible






Protemos translation business management system
Create your account in minutes, and start working! 3-month trial for agencies, and free for freelancers!

The system lets you keep client/vendor database, with contacts and rates, manage projects and assign jobs to vendors, issue invoices, track payments, store and manage project files, generate business reports on turnover profit per client/manager etc.

More info »
Anycount & Translation Office 3000
Translation Office 3000

Translation Office 3000 is an advanced accounting tool for freelance translators and small agencies. TO3000 easily and seamlessly integrates with the business life of professional freelance translators.

More info »