Performing OCR on Japanese PDFs Using Tesseract and Pytesseract

Performing OCR on Japanese PDFs Using Tesseract and Pytesseract

Takahiro Iwasa
Takahiro Iwasa
3 min read
Computer Vision Python

This example demonstrates how to perform OCR on Japanese PDFs using Tesseract OCR v4 and pytesseract.

The source text is Run, Melos! by Osamu Dazai, a work that is now in the public domain.

Requirements

Before we start, ensure the following libraries are installed:

Additionally, Tesseract OCR itself must be installed. In this note, it is set up using Docker. Please refer to the instructions on the official repository for more details.

Building

Writing Python Script

app.py
import re
from datetime import datetime
import pdf2image
import pytesseract
def to_images(pdf_path: str, first_page: int = None, last_page: int = None) -> list:
""" Convert a PDF to a PNG image.
Args:
pdf_path (str): PDF path
first_page (int): First page starting 1 to be converted
last_page (int): Last page to be converted
Returns:
list: List of image data
"""
print(f'Convert a PDF ({pdf_path}) to a png...')
images = pdf2image.convert_from_path(
pdf_path=pdf_path,
fmt='png',
first_page=first_page,
last_page=last_page,
)
print(f'A total of converted png images is {len(images)}.')
return images
def to_string(image) -> str:
""" OCR an image data.
Args:
image: Image data
Returns:
str: OCR processed characters
"""
print(f'Extract characters from an image...')
return pytesseract.image_to_string(image, lang='jpn')
def normalize(target: str) -> str:
""" Normalize result text.
Applying the following:
- Remove new line.
- Remove spaces between Japanese characters.
Args:
target (str): Target text to be normalized
Returns:
str: Normalized text
"""
result = re.sub('\n', '', target)
result = re.sub('([あ-んア-ン一-鿐])\s+((?=[あ-んア-ン一-鿐]))', r'\1\2', result)
return result
def save(result: str) -> str:
""" Save the result text in a text file.
Args:
result (str): Result text
Returns:
str: Text file path
"""
path = 'result.txt'
with open(path, 'w') as f:
f.write(result)
return path
def main() -> None:
start = datetime.now()
result = ''
images = to_images('run-melos.pdf', 1, 2)
for image in images:
result += to_string(image)
result = normalize(result)
path = save(result)
end = datetime.now()
duration = end.timestamp() - start.timestamp()
print('----------------------------------------')
print(f'Start: {start}')
print(f'End: {end}')
print(f'Duration: {int(duration)} seconds')
print(f'Result file path: {path}')
print('----------------------------------------')
if __name__ == '__main__':
main()

Creating Dockerfile

The run-melos.pdf on line 6 can be downloaded here.

Dockerfile
FROM python:3.10
WORKDIR /usr/src/app
COPY app.py ./
COPY requirements.txt ./
COPY run-melos.pdf ./
RUN apt update && apt install -y poppler-utils tesseract-ocr tesseract-ocr-jpn \
&& pip install -r requirements.txt
CMD ["python", "app.py"]
# CMD ["/bin/sh", "-c", "while :; do sleep 10; done"]

Testing

The Python script executes the following:

  1. Convert the PDF to PNG image data using pdf2image.
  2. Extract characters from the converted data using Tesseract OCR and pytesseract.
  3. Save the result in a text file.

Run the following command:

Terminal window
NAME=pytesseract-sample
docker build -t $NAME .
docker run --name $NAME $NAME
# You can see OCR processing result in `result.txt`.
docker cp $NAME:/usr/src/app/result.txt ./
less result.txt
# Clean up
docker container rm $NAME
docker image rm $NAME

Result

The result can be downloaded from:

Takahiro Iwasa

Takahiro Iwasa

Software Developer
Involved in the requirements definition, design, and development of cloud-native applications using AWS. Japan AWS Top Engineers 2020-2023.