HN Debrief

Unlimited OCR: One-shot long-horizon parsing

  • AI
  • Open Source
  • Developer Tools
  • Education

Baidu posted code and a paper for an OCR system that tries to solve a very specific bottleneck in modern vision-language OCR. When these models transcribe long PDFs, they keep a growing memory of everything they have already generated. That memory, the KV cache, eats VRAM and slows decoding, so production systems usually fall back to crude chunking by page or crop. Unlimited OCR replaces that with a reference sliding window setup. The model keeps full access to the original document image while only attending to a short recent slice of its own output. The pitch is simple: keep global visual context, stop hoarding text history, and make long-horizon parsing feasible on smaller hardware.

If you process long, messy documents today by page-splitting and stitching, this is worth testing because it targets exactly that engineering pain. It also signals where document AI is heading: less brittle page-by-page OCR, more streaming parsers that preserve cross-page context without requiring giant GPUs.

Discussion mood

Interested and cautiously positive. People liked the architectural idea and the fact that Baidu open-sourced it, but the mood stayed practical because OCR users have been burned by hallucinations, layout errors, and tool-specific failure modes too many times to trust headline claims without direct benchmarks and production testing.

Key insights

  1. 01

    Layout is still the real bottleneck

    What breaks production OCR is not usually isolated character recognition. It is reconstructing the reading order and structure of ugly real documents with columns, headers, forms, ads, tables, and mixed scripts. Vision-language models help because they can use broader context, but that same dependence on context makes page-by-page pipelines brittle. This architecture is interesting because it attacks that structural problem directly rather than just trying to raise per-character accuracy.

    Evaluate OCR systems on document reconstruction, not just text accuracy. If your inputs have complex layout, prioritize tools that preserve cross-region and cross-page context over ones that only score well on clean page crops.

      Attribution:
    • wongarsu #1
    • joss82 #1
    • sscaryterry #1
  2. 02

    Chunking works until document structure fights back

    People already get decent results by slicing images into overlapping crops and stitching the text back together. That is a useful baseline, not a straw man. The catch is that it relies on predictable line heights and regular layout. Skewed scans and dense label-value documents are where local crops lose the context needed to correct errors, so the manual overlap tricks stop being enough.

    Keep your crop-and-stitch pipeline for uniform documents because it is cheap and proven. For noisy scans or structured business documents, test whether global-image context reduces the post-processing heuristics you need to maintain.

      Attribution:
    • freefaler #1
    • MattRogish #1 #2
  3. 03

    Early mixed-language result looks promising

    A hands-on report with a 4090 said the model handled a long Japanese grammar PDF that mixed English with kanji and hiragana, and it preserved the original languages instead of silently translating them. That matters because a common failure mode in generative OCR is to normalize or paraphrase text rather than transcribe it faithfully.

    If fidelity matters, include mixed-language and non-English pages in your eval set. Check not just accuracy but whether the model preserves script and wording exactly instead of "helpfully" rewriting it.

      Attribution:
    • peterderivaz #1
  4. 04

    Traditional versus VLM OCR depends on script

    The blanket claim that classic OCR is more reliable did not survive contact with multilingual use cases. Printed English documents are a sweet spot for systems like PaddleOCR. Once you move into CJK, Arabic, Vietnamese, Thai, or messy historical material, vision-language models often recover text that older OCR pipelines miss because they handle script variation and context better.

    Do not generalize from English business documents to your whole corpus. Build separate benchmarks by script and document type before deciding whether a classic OCR stack or a vision-language model is the better default.

      Attribution:
    • __rito__ #1
    • chpatrick #1
    • Oras #1
    • j16sdiz #1
  5. 05

    The market still lacks a trusted default

    Even experienced users do not agree on a best OCR stack. The named options span Marker, Mistral OCR, Mathpix, Docling, Azure Document Intelligence, AWS Textract, PaddleOCR, and newer parsers like poma-ai. The pattern is not that one tool dominates. It is that each gets to roughly usable quality, then fails differently on the last stretch. That is why an open release with a new decoding architecture draws attention even in a crowded field.

    Treat OCR vendor selection as workload-specific, not solved procurement. Keep a benchmark harness and periodic bake-off process because new models can be materially better on your documents without being universally better.

      Attribution:
    • aliljet #1
    • ljouhet #1
    • badlibrarian #1
    • ai_fry_ur_brain #1
    • gettingoverit #1
    • arboles #1

Against the grain

  1. 01

    For many use cases OCR is already good enough

    The skeptical view is that this solves a problem most teams do not have. If your job is extracting text from ordinary documents, current vision models are already consistent and stable enough, so rebuilding the OCR engine looks like research churn rather than a practical advance.

    Before adopting a new OCR architecture, quantify whether long-document memory limits are actually your bottleneck. If your current stack meets accuracy and cost targets, this may be interesting but unnecessary.

      Attribution:
    • Oras #1
  2. 02

    Context-aware OCR can guess too much

    Generative OCR gets dangerous when it stops transcribing and starts inferring. The examples given were foreign words being translated, names being normalized to more common spellings, and abbreviations being expanded incorrectly. For archives, legal records, and any workflow that cares about the literal source text, better language priors can quietly make the output less trustworthy.

    If exact transcription matters, require confidence signaling and spot checks on ambiguous tokens. Do not reward a model for plausible text if your downstream use depends on preserving the source verbatim.

      Attribution:
    • pbhjpbhj #1
    • pmarreck #1

In plain english

4090
NVIDIA GeForce RTX 4090, a high-end consumer graphics card often used to run AI models locally.
CJK
Chinese, Japanese, and Korean, often grouped together because their writing systems pose similar OCR challenges.
FineReader
ABBYY FineReader, a long-standing commercial OCR product for document scanning and conversion.
KV cache
Key-value cache, the stored internal attention state a transformer keeps so it can generate long sequences more efficiently, at the cost of growing memory use.
OCR
Optical Character Recognition, the process of turning text in images or scanned documents into machine-readable text.
olmOCR
A benchmark and toolset referenced in the comments for evaluating OCR systems on document understanding tasks.
PDF
Portable Document Format, a common file format for digital documents that preserves layout.
VRAM
Video random-access memory, the memory on a GPU used for graphics and AI workloads.

Reference links

Paper and project references

Technical explainers and failure cases

Music notation and OMR resources

  • MEI Music Encoding Initiative
    Suggested as the format used by musicologists and researchers for rich symbolic music representation.
  • Verovio engraver
    Named as a renderer that can preserve score metadata and help generate training data for optical music recognition.
  • svg2pl script
    Shared as a script to convert engraved scores into a COCO-style detection dataset.
  • TROMPA COCO dataset
    Linked as a synthetic music score dataset for training or testing music detection models.
  • MuseScore MEI support docs
    Mentioned to show that MEI support is being added to MuseScore.
  • Harte chord notation paper
    Referenced as a precise symbolic notation for chord analysis datasets.
  • ChoCo dataset
    Shared as a research dataset using Harte notation for chord analysis.
  • When in Rome dataset
    Named as another music analysis dataset, though not a full engraving format.
  • ABC notation
    Raised as a simpler text format for notated music and discussed as a possible model-friendly representation.
  • ABC notation history
    Used to clarify that ABC originated in European folk music rather than church music.