Extracting Manga Text For JPDB Using Mokuro

Step 1: Get raw images of Manga somehow

Step 2: Use Mokuro

Mokuro is a python OCR program for manga: Mokuro

  • (Optional but not really) Create a virtualenv for python 3.9, python 3.10 is not supported yet

    • virtualenv -p python3 venv
    • source venv/bin/activate or source venv/Scripts/activate if on Windows
  • (Optional) Install PyTorch with CUDA: PyTorch

    • Note: The command is wrong for pip at the moment, use this:
    • pip3 install torch==1.12.1+cu116 torchvision==0.13.1+cu116 torchaudio==0.12.1+cu116 --extra-index-url https://download.pytorch.org/whl/cu116
  • Install Mokuro pip3 install mokuro

  • Run Mokuro (Check Documentation for more info)

    • mokuro --parent-dir "path/to/parent/dir"
    • Note: OCR takes a while, the better your computer is, the faster it’ll be
  • Congratulations, you now have a Yomichan-able html file

    • Note: For Yomichan, turn on the setting: Allow access to file URLs, and refresh the page

Step 3: Extract Text From JSON to use in JPDB

  • Use this dumb script, or something
import glob
import json

output_filename = 'output.txt'
directory = r"path\to\directory\_ocr"

files = glob.glob(f"{directory}/**/*.json", recursive=False)
with open(output_filename,'w',encoding='utf-8') as out_file:
	for file in files:
		print(file)
		with open(file, 'r',encoding='utf-8') as f:
			input = json.load(f)
			for block in input["blocks"]:
				for line in block["lines"]:
					out_file.write(line + '\n')

Paste that shit from output.txt in JPDB