How to Run Tesseract OCR with pytesseract in Lambda Container Images
Developers can run Tesseract OCR with pytesseract using Lambda container images.
You can pull an example code used in this post from my GitHub repository.
Prerequisites
Install the following on you computer.
- AWS SAM
- Python 3.x
Creating SAM Application
Directory Structure
/
|-- src/
| |-- Dockerfile
| |-- __init__.py
| |-- app.py
| |-- requirements.txt
| `-- run-melos.pdf
|-- README.md
|-- __init__.py
|-- requirements.txt
`-- template.yaml
AWS SAM Template
The example below uses EventBridge as a Lambda trigger because API Gateway has a maximum timeout limit of 29 seconds and the sample Python script runs for more than 2 minutes.
50 milliseconds - 29 seconds for all integration types, including Lambda, Lambda proxy, HTTP, HTTP proxy, and AWS integrations.
AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: Tesseract OCR Sample with AWS Lambda Container Images using AWS SAM
Resources:
TesseractOcrSample:
Type: AWS::Serverless::Function
Properties:
Events:
Schedule:
Type: Schedule
Properties:
Enabled: true
Schedule: cron(0 * * * ? *)
MemorySize: 512
PackageType: Image
Timeout: 900
Metadata:
DockerTag: latest
DockerContext: ./src/
Dockerfile: Dockerfile
Dockerfile
Create Dockerfile
with the following content.
If you intend to use your local language like Japanese, add ENV LANG=ja_JP.UTF-8
; otherwise you should see garbled texts in Docker standard output.
FROM public.ecr.aws/lambda/python:3.9
ENV LANG=ja_JP.UTF-8
WORKDIR ${LAMBDA_TASK_ROOT}
COPY app.py ./
COPY requirements.txt ./
COPY run-melos.pdf ./
RUN rpm -Uvh https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm \
&& yum update -y && yum install -y poppler-utils tesseract tesseract-langpack-jpn \
&& pip install -U pip && pip install -r requirements.txt --target "${LAMBDA_TASK_ROOT}"
CMD ["app.lambda_handler"]
Python Script
Create requirements.txt
and install using pip install -r requirements.txt
.
pdf2image==1.16.0
pytesseract==0.3.9
Create app.py
with the following code.
import re
from datetime import datetime
import pdf2image
import pytesseract
def lambda_handler(event: dict, context: dict) -> None:
start = datetime.now()
result = ''
images = to_images('run-melos.pdf', 1, 2)
for image in images:
result += to_string(image)
result = normalize(result)
end = datetime.now()
duration = end.timestamp() - start.timestamp()
print('----------------------------------------')
print(f'Start: {start}')
print(f'End: {end}')
print(f'Duration: {int(duration)} seconds')
print(f'Result: {result}')
print('----------------------------------------')
def to_images(pdf_path: str, first_page: int = None, last_page: int = None) -> list:
""" Convert a PDF to a PNG image.
Args:
pdf_path (str): PDF path
first_page (int): First page starting 1 to be converted
last_page (int): Last page to be converted
Returns:
list: List of image data
"""
print(f'Convert a PDF ({pdf_path}) to a png...')
images = pdf2image.convert_from_path(
pdf_path=pdf_path,
fmt='png',
first_page=first_page,
last_page=last_page,
)
print(f'A total of converted png images is {len(images)}.')
return images
def to_string(image) -> str:
""" OCR an image data.
Args:
image: Image data
Returns:
str: OCR processed characters
"""
print(f'Extract characters from an image...')
return pytesseract.image_to_string(image, lang='jpn')
def normalize(target: str) -> str:
""" Normalize result text.
Applying the following:
- Remove newlines.
- Remove spaces between Japanese characters.
Args:
target (str): Target text to be normalized
Returns:
str: Normalized text
"""
result = re.sub('\n', '', target)
result = re.sub('([あ-んア-ン一-鿐])\s+((?=[あ-んア-ン一-鿐]))', r'\1\2', result)
return result
Build
Build by running sam build
.
$ sam build
...
Build Succeeded
Built Artifacts : .aws-sam/build
Built Template : .aws-sam/build/template.yaml
Commands you can use next
=========================
[*] Validate SAM template: sam validate
[*] Invoke Function: sam local invoke
[*] Test Function in the Cloud: sam sync --stack-name {stack-name} --watch
[*] Deploy: sam deploy --guided
To run this script in your local environment, run sam local invoke
.
$ sam local invoke
...
START RequestId: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx Version: $LATEST
Convert a PDF (run-melos.pdf) to a png...
A total of converted png images is 2.
Extract characters from an image...
Extract characters from an image...
----------------------------------------
Start: 2022-06-19 17:37:36.001748
End: 2022-06-19 17:40:18.842054
Duration: 162 seconds
Result: PDD図書館管理番号 000.000002ー800 走れメロス太宰治=作メロスは激怒した。
...
----------------------------------------
END RequestId: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
REPORT RequestId: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx Init Duration: 1.09 ms Duration: 163525.15 ms Billed Duration: 163526 ms Memory Size: 512 MB Max Memory Used: 512 MB
Deploy
If you do not have an ECR repository, create one with the following command.
aws ecr create-repository --repository-name tesseract-ocr-lambda
Replace --image-repository
value with your ECR repository, and deploy the application with the following command.
$ sam deploy \
--stack-name aws-lambda-tesseract-ocr-sample \
--image-repository 123456789012.dkr.ecr.ap-northeast-1.amazonaws.com/tesseract-ocr-lambda \
--capabilities CAPABILITY_IAM
...
Successfully created/updated stack - aws-lambda-tesseract-ocr-sample in None
After deployment, your Lambda function will run every hour and OCR results will be written to CloudWatch Logs.
Cleaning Up
Clean up the provisioned AWS resources with the following command.
sam delete --stack-name aws-lambda-tesseract-ocr-sample
Conclusion
AWS Lambda can take advantage of many Docker images provided by users and enterprises across the world.
I hope you will find this post useful.