This site uses cookies.
Some of these cookies are essential to the operation of the site,
while others help to improve your experience by providing insights into how the site is being used.
For more information, please see the ProZ.com privacy policy.
Building a tool to OCR and translate scanned PDFs without losing the formatting
Thread poster: Kyle Corbitt
Kyle Corbitt United States Local time: 19:29 Spanish to English + ...
May 30, 2023
Hi everyone, I've started building a system that combines OCR and MT to quickly produce a draft translation of scanned images and PDFs. It keeps all the formatting of the original document and just adds editable text boxes on top, which saves a ton of time on prep/formatting. It's particularly useful for simple forms like birth certificates (it doesn't work well yet for documents with longer prose... See more
Hi everyone, I've started building a system that combines OCR and MT to quickly produce a draft translation of scanned images and PDFs. It keeps all the formatting of the original document and just adds editable text boxes on top, which saves a ton of time on prep/formatting. It's particularly useful for simple forms like birth certificates (it doesn't work well yet for documents with longer prose). The URL is https://translato.ai
My wife and I actually built this because we wanted a tool like this for ourselves but couldn't find one. We had to manually translate her birth certificate and other documentation when we moved to the US, and I was surprised that there was no way to do that conveniently.
I initially planned for the tool to be used by individuals, but I've actually shown it to a few professional translators and they mentioned that there wasn't a good tool for translating scanned documents for professionals either, so I decided to share it here as well.
I'd really appreciate any feedback on whether this helps your workflow. The service is totally free.
The contents of this post will automatically be included in the ticket generated. Please add any additional comments or explanation (optional)
Stepan Konev Russian Federation Local time: 05:29 English to Russian
PDF output format
May 31, 2023
I have tested your tool. Thank you for your effort and work. However I doubt if I can find a use for it. I put in a non-editable jpg file and I get a non-editable pdf file again. Any machine translation requires post-editing. That means I have to ocr the output pdf to make it editable.
Subject:
Comment:
The contents of this post will automatically be included in the ticket generated. Please add any additional comments or explanation (optional)
Samuel Murray Netherlands Local time: 04:29 Member (2006) English to Afrikaans + ...
@Kyle
May 31, 2023
I have tested your tool and it works well for its intended purpose. It takes a bit of experimentation to learn all of its features, as the way some users might expect it to work is not how it works. E.g. some might expect to be able to download an editable file. When I tested it, I selected English to English as the language combination, so as to not have a machine translation inserted into the segments.
Subject:
Comment:
The contents of this post will automatically be included in the ticket generated. Please add any additional comments or explanation (optional)
Kyle Corbitt United States Local time: 19:29 Spanish to English + ...
TOPIC STARTER
editing
May 31, 2023
Stepan Konev wrote:
I have tested your tool. Thank you for your effort and work. However I doubt if I can find a use for it. I put in a non-editable jpg file and I get a non-editable pdf file again. Any machine translation requires post-editing. That means I have to ocr the output pdf to make it editable.
Hi Stepan, thanks so much for your feedback!
The intention is to do all editing in the tool itself. When your file has been imported, you'll see an interface where all the text is editable. You can then click on any of the text boxes and move them, resize them, etc. Once you're satisfied with that you can then export the final version as a PDF.
That said, I understand you may have a workflow where it's more convenient to export in an editable format and do further post-processing that way. Is there a particular export format that would be most convenient and useful for you?
Stepan Konev
Subject:
Comment:
The contents of this post will automatically be included in the ticket generated. Please add any additional comments or explanation (optional)
Kyle Corbitt United States Local time: 19:29 Spanish to English + ...
TOPIC STARTER
English to English
May 31, 2023
Samuel Murray wrote:
I have tested your tool and it works well for its intended purpose. It takes a bit of experimentation to learn all of its features, as the way some users might expect it to work is not how it works. E.g. some might expect to be able to download an editable file. When I tested it, I selected English to English as the language combination, so as to not have a machine translation inserted into the segments.
Hi Samuel, thanks so much for your feedback! I assume you selected English to English because you intend to use it mostly for the OCR capabilities, not for the MT pass? What type of document did you use for your test, and did it have any trouble identifying the text and making it editable?
Subject:
Comment:
The contents of this post will automatically be included in the ticket generated. Please add any additional comments or explanation (optional)
Create customer quotes and invoices from within Trados Studio
Trados Business Manager Lite helps to simplify and speed up some of the daily tasks, such as invoicing and reporting, associated with running your freelance translation business.
Exclusive discount for ProZ.com users!
Save over 13% when purchasing Wordfast Pro through ProZ.com. Wordfast is the world's #1 provider of platform-independent Translation Memory software. Consistently ranked the most user-friendly and highest value