AWS Machine Learning Blog

Amazon Textract recognizes handwriting and adds five new languages

Documents are a primary tool for communication, collaboration, record keeping, and transactions across industries, including financial, medical, legal, and real estate. The format of data can pose an extra challenge in data extraction, especially if the content is typed, handwritten, or embedded in a form or table. Furthermore, extracting data from your documents is manual, error-prone, time-consuming, expensive, and does not scale. Amazon Textract is a machine learning (ML) service that extracts printed text and other data from documents as well as tables and forms.

We’re pleased to announce two new features for Amazon Textract: support for handwriting in English documents, and expanding language support for extracting printed text from documents typed in Spanish, Portuguese, French, German, and Italian.

Handwriting recognition with Amazon Textract

Many documents, such as medical intake forms or employment applications, contain both handwritten and printed text. The ability to extract text and handwriting has been a need our customers have asked us for. Amazon Textract can now extract printed text and handwriting from documents written in English with high confidence scores, whether it’s free-form text or text embedded in tables and forms. Documents can also contain a mix of typed text or handwritten text.

The following image shows an example input document containing a mix of typed and handwritten text, and its converted output document.

You can log in to the Amazon Textract console to test out the handwriting feature, or check out the new demo by Amazon Machine Learning Hero Mike Chambers.

Not only can you upload documents with both printed text and handwriting, you can also use Amazon Augmented AI (Amazon A2I), which makes it easy to build workflows for a human review of the ML predictions. Adding in Amazon A2I can help you get to market faster by having your employees or AWS Marketplace contractors review the Amazon Textract output for sensitive workloads. For more information about implementing a human review, see Using Amazon Textract with Amazon Augmented AI for processing critical documents. If you want to use one of our AWS Partners, take a look at how Quantiphi is using handwriting recognition for their customers.

Additionally, we’re pleased to announce our language expansion. Customers can now extract and process documents in more languages.

New supported languages in Amazon Textract

Amazon Textract now supports processing printed documents in Spanish, German, Italian, French, and Portuguese. You can send documents in these languages, including forms and tables, for data and text extraction, and Amazon Textract automatically detects and extracts the information for you. You can simply upload the documents on the Amazon Textract console or send them using either the AWS Command Line Interface (AWS CLI) or AWS SDKs.

AWS customer success stories

AWS customers like yourself are always looking for ways to overcome document processing. In this section, we share what our customers are saying about Amazon Textract.

Intuit

Intuit is a provider of innovative financial management solutions, including TurboTax and QuickBooks, to approximately 50 million customers worldwide.

“Intuit’s document understanding technology uses AI to eliminate manual data entry for our consumer, small business, and self-employed customers. For millions of Americans who rely on TurboTax every year, this technology simplifies tax filing by saving them from the tedious, time-consuming task of entering data from financial documents. Textract is an important element of Intuit’s document understanding capability, improving data extraction accuracy by analyzing text in the context of complex financial forms.”

– Krithika Swaminathan, VP of AI, Intuit

Veeva

Veeva helps cosmetics, consumer goods and chemical companies bring innovative, high quality products to market faster without compromising compliance.

“Our customers are processing millions of documents per year and have a critical need to extract the information stored within the documents to make meaningful business decisions. Many of our customers are multinational organizations which means the documents are submitted in various languages like Spanish or Portuguese. Our recent partnership with AWS allowed us early access to Amazon Textract’s new feature that supports additional languages like Spanish, and Portuguese. This partnership with Textract has been key to work closely, iterate and deliver exceptional solutions to our customers.”

– Ali Alemdar, Sr Product Manager, Veeva Industries

Baker Tilly

Baker Tilly is a leading advisory, tax and assurance firm dedicated to building long-lasting relationships and helping customers with their most pressing problems — and enabling them to create new opportunities.

“Across all industries, forms are one of the most popular ways of collecting data. Manual efforts can take hours or days to “read” through digital forms. Leveraging Amazon Textract’s Optical Character Recognition (OCR) technology we can now read through these digital forms quicker and effortlessly. We now leverage handwriting as part of Textract to parse out handwritten entities. This allows our customers to upload forms with both typed and handwritten text and improve their ability to make key decisions through data quickly and in a streamlined process. Additional, Textract easily integrates with Amazon S3 and RDS for instantaneous access to processed forms and near real-time analytics.”

-Ollie East – Director of Advanced Analytics and Data Engineering

ARQ Group

ARQ Group is the leading end-to-end provider of digital solutions for the corporate and government market.

“At ARQ Group, we work with different transportation companies and their physical asset maintenance teams. Their processes have been refined over many years. Previous attempts to digitize the process caused too much disruption and consequently failed to be adopted. Textract allowed us to provide a hybrid solution to gain the benefits of predictive insights coming from digitizing maintenance data, whilst still allowing our customer workforce to continue following their preferred handwritten process. This is expected to result in a 22% reduction in downtime and 18% reduction in maintenance cost, as we can now predict when parts are likely to fail and schedule for maintenance to happen outside of production hours. We are also expecting the lifespan of our customer assets to increase, now that we are preventing failure scenarios.”

– Daniel Johnson, Business Segment Director, ARQ Group

Belle Fleur

Belle Fleur believes the ML revolution is altering the way we live, work, and relate to one another, and will transform the way every business in every industry operates.

“We use Amazon Textract to detect text for our clients that have the three Vs when it pertains to data: Variety, Velocity, and Volume, and particularly our clients that have different document formats to process information and data properly and efficiently. The feature designed to recognize the various different formats, whether it’s tables or forms and now with handwriting recognition, is an AI dream come true for our medical, legal, and commercial real estate clients. We are so excited to roll out this new handwriting feature to all of our customers to further enhance their current solution, especially those with lean teams. We are able to allow the machine learning to handle the heavy lifting via automation to read thousands of documents in a fraction of the time and allow their teams to focus on higher-order assignments.”

– Tia Dubuisson, President, Belle Fleur

Lumiq

Lumiq is a data analytics company, holding the deep domain and technical expertise to build and implement AI- and ML-driven products and solutions. Their data products are built like building blocks and run on AWS, which helps their customers scale the value of their data and drive tangible business outcomes.

“With thousands of documents being generated and received across different stages of the consumer engagement lifecycle every day, one of our customers (a leading insurance service provider in India) had to invest several manual hours for data entry, data QC, and validation. The document sets consisted of proposal forms, supporting documents for identity, financials, and medical reports, among others. These documents were in different, non-standardized formats and some of them were handwritten, resulting in an increased average lag in lead to policy issuance and impacted customer experience.

“We leveraged Amazon’s machine learning-powered Textract to extract information and insights from various types of documents, including handwritten text. Our custom solution built on top of Amazon Textract and other AWS services helped in achieving a 97% reduction in human labor for PII redaction and a projected 70% reduction in work hours for data entry. We are excited to further deep-dive into Textract to enable our customers with an E2E paperless workflow and enhance their end-consumer experience with significant time savings.”

– Mohammad Shoaib, Founder and CEO, Lumiq (Crisp Analytics)

QL Resources

QL is among Asean’s largest egg producers and surimi manufacturers, and is building a presence in the sustainable palm oil sector with activities including milling, plantations, and biomass clean energy.

“We have a large amount of handwritten documents that are generated daily in our factories, where it is challenging to ubiquitously install digital capturing devices. With the custom solution developed by our AWS partner Axrail using Amazon Textract and various AWS services, we are able to digitize documents for both printed and handwritten hard copy forms that we generated on the production floor daily, especially in production areas where digital capturing tools are not available or economical. This is a sensible solution and completes the missing link for full digitization of our production data.”

– Chia Lik Khai, Director, QL Resources

Summary

We continually make improvements to our products based on your feedback, and we encourage you to log in to the Amazon Textract console and upload a sample document and use the APIs available. You can also talk with your account manager about how best to incorporate these new features. Amazon Textract has many resources to help you get started, like blog posts, videos, partners, and getting started guides. Check out the Textract resources page for more information.

You have millions of documents, which means you have a ton of meaningful and critical data within those documents. You can extract and process your data in seconds rather than days, and keep it secure by using Amazon Textract. Get started today.

 


About the Author

Andrea Morton-Youmans is a Product Marketing Manager on the AI Services team at AWS. Over the past 10 years she has worked in the technology and telecommunications industries, focused on developer storytelling and marketing campaigns. In her spare time, she enjoys heading to the lake with her husband and Aussie dog Oakley, tasting wine and enjoying a movie from time to time.