Performing OCR on Japanese PDFs Using Tesseract and Pytesseract

This example demonstrates how to perform OCR on Japanese PDFs using Tesseract OCR v4 and pytesseract.
The source text is Run, Melos! by Osamu Dazai, a work that is now in the public domain.
Requirements
Before we start, ensure the following libraries are installed:
Additionally, Tesseract OCR itself must be installed. In this note, it is set up using Docker. Please refer to the instructions on the official repository for more details.
Building
Writing Python Script
import refrom datetime import datetime
import pdf2imageimport pytesseract
def to_images(pdf_path: str, first_page: int = None, last_page: int = None) -> list: """ Convert a PDF to a PNG image.
Args: pdf_path (str): PDF path first_page (int): First page starting 1 to be converted last_page (int): Last page to be converted
Returns: list: List of image data """
print(f'Convert a PDF ({pdf_path}) to a png...') images = pdf2image.convert_from_path( pdf_path=pdf_path, fmt='png', first_page=first_page, last_page=last_page, ) print(f'A total of converted png images is {len(images)}.') return images
def to_string(image) -> str: """ OCR an image data.
Args: image: Image data
Returns: str: OCR processed characters """
print(f'Extract characters from an image...') return pytesseract.image_to_string(image, lang='jpn')
def normalize(target: str) -> str: """ Normalize result text.
Applying the following: - Remove new line. - Remove spaces between Japanese characters.
Args: target (str): Target text to be normalized
Returns: str: Normalized text """
result = re.sub('\n', '', target) result = re.sub('([あ-んア-ン一-鿐])\s+((?=[あ-んア-ン一-鿐]))', r'\1\2', result) return result
def save(result: str) -> str: """ Save the result text in a text file.
Args: result (str): Result text
Returns: str: Text file path """
path = 'result.txt' with open(path, 'w') as f: f.write(result) return path
def main() -> None: start = datetime.now() result = ''
images = to_images('run-melos.pdf', 1, 2) for image in images: result += to_string(image) result = normalize(result) path = save(result)
end = datetime.now() duration = end.timestamp() - start.timestamp()
print('----------------------------------------') print(f'Start: {start}') print(f'End: {end}') print(f'Duration: {int(duration)} seconds') print(f'Result file path: {path}') print('----------------------------------------')
if __name__ == '__main__': main()
Creating Dockerfile
The run-melos.pdf
on line 6 can be downloaded here.
FROM python:3.10
WORKDIR /usr/src/appCOPY app.py ./COPY requirements.txt ./COPY run-melos.pdf ./RUN apt update && apt install -y poppler-utils tesseract-ocr tesseract-ocr-jpn \ && pip install -r requirements.txt
CMD ["python", "app.py"]# CMD ["/bin/sh", "-c", "while :; do sleep 10; done"]
Testing
The Python script executes the following:
- Convert the PDF to PNG image data using pdf2image.
- Extract characters from the converted data using Tesseract OCR and pytesseract.
- Save the result in a text file.
Run the following command:
NAME=pytesseract-sampledocker build -t $NAME .docker run --name $NAME $NAME
# You can see OCR processing result in `result.txt`.docker cp $NAME:/usr/src/app/result.txt ./less result.txt
# Clean updocker container rm $NAMEdocker image rm $NAME
Result
The result can be downloaded from: