📷 OCR — Image / PDF to Text

Tesseract 5 with the trained Japanese model. Extract text from PNG / JPEG / WebP / GIF or PDF (max 50 MB, 50 pages per PDF).

100% Free No signup Server-side No logs / DB Rate-limited VPS high-accuracy OSS-based 5 languages

🔒 About Privacy

・Uploaded files are passed to Tesseract then immediately deleted (a few seconds in /tmp at most).
・No logs of OCR text, file names, or sizes are kept.
・Rate limit: 30 requests per IP per minute.

Choose file (image or PDF)

📂

Drag & drop here, or click to choose

PNG / JPEG / WebP / GIF / PDF · ≤50MB

Language

Result text

📖 Where people get stuck

Extracts text from images (PNG, JPEG, WebP, GIF) and PDFs using Tesseract 5 with a Japanese trained model, up to 50MB, up to 50 PDF pages, and 30 requests per minute per IP. Accuracy is determined far more by the quality of the input than by the model — the same document scanned at 300dpi and photographed at an angle produce completely different results. When it goes badly, suspect the input first.

Case	What happens	What to do
The accuracy is lower than you expected	Tesseract is optimised for printed type. Handwriting, low resolution, patterned backgrounds, shadows, skew and creases therefore degrade it sharply. Resolution matters most: below 300dpi the recognition rate visibly deteriorates — a document photographed on a phone is often readable on screen while having too few pixels per character. Skew matters too: even a few degrees can break line segmentation and merge several lines together. Coloured backgrounds and faint rules also get mistaken for glyphs and produce spurious characters.	When scanning, use at least 300dpi, greyscale, and correct the skew. When photographing, the basics are straight on, positioned so no shadow falls across it, and filling the frame with the document. If it is skewed, rotate it straight before running OCR (image rotate tool) — that one step alone can transform the result. Upscaling a low-resolution image does not improve accuracy: information that is not there cannot be added, so you have to retake it. If the accuracy still falls short, raising the contrast and binarising as a preprocessing step helps. As a practical rule, if five pages are not good enough, running fifty will not be either — improve the input first.
Similar-looking characters get swapped	Distinguishing similar shapes is a fundamentally hard problem: digit `1` against lowercase `l` and capital `I`, `0` against `O`, and in Japanese the katakana ro against the kanji for mouth. Humans resolve these from context — which of two near-identical spellings is right follows from the meaning, not from the letterforms. The dangerous case is a string with no context: a part number, a serial, a password or a URL, where no language model correction applies and the error simply stands.	Narrowing the language reduces confusions: for a purely Japanese document, choosing Japanese only removes any scope for confusing it with Latin letters, and for an alphanumeric document, English only does the same. Japanese plus English is convenient, but more candidates means more mistakes. More effective still is validating after the OCR: any string whose format you know can be corrected mechanically with a regular expression — if a part number is three letters and four digits, an `O` appearing in a digit position is definitively a misread `0` (regex tester). And wherever a single wrong character is fatal — amounts, account numbers, part numbers — a person must check against the original. OCR is a tool for reducing typing, not for skipping verification.
Tables and layout come out broken	Tesseract recognises lines and nothing above them; it does not reconstruct columns or table structure. Feed it a table and the rules vanish, leaving the cell contents strung out on one line separated by spaces — nothing in the output tells you where one cell ended. A two-column layout behaves the same way, with the left and right columns interleaved line by line. Headers, footers, page numbers and marginal notes are not distinguished from the body either and land in the middle of the prose.	If tables are the goal, use a table-extraction approach rather than OCR — Tabula and Camelot infer columns from rules and whitespace. They assume a PDF with a text layer, though, so for a scanned table the practical route is OCR the characters and rebuild the table by hand. For a multi-column document, cut the page into left and right halves and OCR them separately and it comes out cleanly (image crop tool). Fundamentally, OCR is the wrong tool when you need the layout preserved — asking whoever produced the document for the source data is often both the most reliable and the fastest route.

This is the only tool on this site that sends your file to a server — the Japanese trained model is too large to run in a browser, so the premise differs from every other page here. Files that are sent sit in a temporary directory for a few seconds and are deleted the moment Tesseract finishes, and no log is kept of the recognised text, the filename or the size. Even so, there are situations where sending it anywhere is itself not permitted: contracts, medical records, personnel files and unpublished financial information are routinely covered by organisational rules forbidding transmission to external services. In those cases, install tesseract locally and run it directly (brew install tesseract tesseract-lang or apt install tesseract-ocr tesseract-ocr-jpn) — the same engine and the same model as this page, so the results are equivalent. The test is whether you would be allowed to email that file outside your organisation: if not, do not put it in here either.

📖 How to Use

1

Choose file

Drag & drop an image (PNG / JPEG / WebP / GIF) or PDF (max 50 MB).
2

Pick language

Pick Japanese + English (recommended), Japanese only, or English only.
3

Run → copy or download

Click Run OCR. Copy the result or download as .txt.

❓ Frequently Asked Questions

How accurate is it?

Uses Tesseract 5 with the official Japanese trained model. Clean print (books, PDFs, scans) achieves 90%+; handwriting, complex backgrounds, and low resolution degrade accuracy.

PDF page limit?

PDF: pages 1–20. Ghostscript rasterizes each page to 300 dpi grayscale PNG, then OCR runs per page.

Are uploaded files stored?

No. Files live in a temp directory for a few seconds and are deleted right after Tesseract finishes. No logs of OCR text, file names, or sizes are kept.

🔗 Related Tools

🐛 Found a bug or issue with this tool?

Free to use, no signup. Even just the steps to reproduce are helpful. Reports go directly to the operator and help us fix issues.