Launch HN: Parsewise (YC P25) – Reason Across Documents with an API

AI
Developer Tools
Data Infrastructure
Startups

Parsewise launched an API for extracting structured data from large, messy document collections such as PDFs, spreadsheets, emails, and transcripts. The pitch is that it can reason across many files, return schema-compliant output, and attach word-level lineage for every value so a human can verify where each answer came from. The founders framed this as unstructured-data ETL for teams that need more than page-by-page OCR or one-shot prompts to Claude. They said the system is model-agnostic, uses smaller models for exhaustive search, larger models for resolution decisions, and avoids embedding-based retrieval because dense specialist corpora collapse into weak similarity signals.

If your team is trying to operationalize LLM-based document extraction, the hard part is no longer "can a model read this PDF" but "can we trust, audit, and maintain the output across changing document sets." Treat schema design, review workflow, and citation traceability as core product work, not cleanup after the model call.

July 1, 2026
news.ycombinator.com
Discuss on HN

Key insights

No universal parquet for documents

Any "document as data" layer has to be shaped around the questions you need answered, not discovered once and reused everywhere. That is why the founder’s reply matters. They are effectively selling a configurable intermediate representation for a workflow, which is a much more opinionated and labor-heavy product than generic document parsing.

If you are building in this space, budget for per-use-case schema and representation design. A generic ingestion layer will get you a demo, but production value comes from the task-specific middle layer.

Attribution:

whinvik #1
gergelycsegzi #1

Agent definitions become ongoing operations

Portability across domains is weak because the real failure mode is not the first happy-path extraction but the steady arrival of out-of-distribution documents. The founder described customers tightening definitions over time, reviewing feedback, and comparing before-and-after results across existing data. That makes these "agents" look less like prompts and more like living business rules with a QA loop.

Plan ownership before rollout. Someone on the business side needs to review misses, update definitions, and regression-test changes, or accuracy will drift silently.

Attribution:

chaitralikakde #1
gergelycsegzi #1

The moat is downstream structure and review

The sharpest competitive framing came from both a customer-style question and a competitor response. OCR price and page extraction are racing toward commodity status, so value shifts to cross-document resolution, typed output, discrepancy handling, and reviewer tooling. The interesting part is that even a rival product founder basically validated the need for explicit schema and workflow integration at scale.

Do not compare vendors on OCR alone. Ask how they handle cross-file conflicts, typed outputs, regression testing, and human review time.

Attribution:

gorgmah #1
gergelycsegzi #1 #2
joss82 #1

Embeddings break down in dense specialist corpora

The founders argued that vector similarity performs poorly when a corpus is full of near-duplicate topic space, like decades of treasury documents where the important signal is small numeric or categorical variation. Their alternative is exhaustive model-driven search plus later-stage reasoning. Whether or not that wins universally, it is a useful warning that retrieval quality can collapse even when documents are all "relevant."

If your corpus is narrow and repetitive, test retrieval carefully before assuming a standard retrieval-augmented generation stack will work. Small differences in numbers, years, or clauses may need exhaustive search rather than nearest-neighbor lookup.

Attribution:

dennis16384 #1
gergelycsegzi #1 #2

Traceability matters most where users stay close to source material

The strongest pull outside enterprise automation came from archival and research use cases, where people already stitch together messy pipelines that save time but make provenance painful. Parsewise’s citations and side-by-side sourcing landed because they let researchers or regulated operators keep contact with original material instead of accepting a black-box answer.

If your users must defend or revisit an answer later, provenance is part of the product, not a nice-to-have. Optimize the path from output back to exact source evidence.

Attribution:

vinaigrette #1 #2
gergelycsegzi #1 #2

Against the grain

Single-model prompts may already be enough

For plenty of use cases, the extra machinery is overkill. The founder effectively conceded that point. If Claude can handle your document set and you do not need durable schema, scale, or verification tooling, another orchestration layer just adds cost and complexity.

Start with the simplest workflow that works. Only move to a specialized pipeline when context limits, audit needs, or review burden become the actual blocker.

Attribution:

hnuser #1
gergelycsegzi #1

Large corpora may need indexing first

At e-discovery scale, even the founders backed away from an all-LLM answer and suggested indexing or keyword search because cost and latency take over. That undercuts any notion that exhaustive model-based search is the universal solution. Traditional retrieval infrastructure still has a place.

For very large archives, separate search from extraction. Use indexes or databases to narrow candidates before running expensive reasoning steps.

Attribution:

vmandrade #1
gergelycsegzi #1
dennis16384 #1

In plain english

e-discovery ↩

Electronic discovery, the process of finding and reviewing digital documents for legal or investigative work.

embedding ↩

A numeric representation of text or other data used to measure similarity or support search and machine learning tasks.

ETL ↩

Extract, transform, load, the process of pulling data from sources, cleaning or reshaping it, and loading it into a destination system.

lineage ↩

The record of where a data value came from and how it was derived.

LLM ↩

Large Language Model, an AI model trained on large amounts of text and used for chatbots, coding tools, and agents.

OCR ↩

Optical character recognition, software that turns text in scans or images into machine-readable text.

Parquet ↩

A columnar data file format designed for efficient storage and querying of structured data.

schema ↩

A defined structure for data, such as the expected fields, types, and rules in a JSON object or table.

Reference links

Product and docs

Parsewise API site
Main product page linked by the founders in the launch post and comments.
Parsewise schema-driven extract docs
Documentation showing the fixed JSON schema workflow discussed in competitive comparison.
Parsewise document processing pipelines writeup
Technical writeup the founders cited to explain how they handle varied document types and extraction pipelines.

Benchmarks and demos

Parsewise OfficeQA benchmark page
Benchmark results page used to support the claim about cross-document reasoning over 90k pages.
Parsewise intermediate representation demo
Example of the configurable intermediate layer and traceable extracted data discussed in multiple replies.

Videos

Parsewise use case video
Founders’ overview video showing product use cases.
Demo UI explanation clip
Clip linked by the founder to explain that the demo UI was intentionally quickly built for integration demos.

Competing or related tools

Struktur
Open source document extraction scaffolding mentioned by a commenter who built a similar tool.

Launch HN: Parsewise (YC P25) – Reason Across Documents with an API

Discussion mood

Key insights

Against the grain

In plain english

Reference links

Product and docs

Benchmarks and demos

Videos

Competing or related tools