Adaptive PDFs

AI
Accessibility
Security
Developer Tools

The post demonstrates an "adaptive PDF" that uses PDF replacement text and related structure so a human sees a normal designed document, while text extraction tools can pull out cleaner, more structured content such as Markdown. The point is not that the visible PDF changes by reader. It is that the machine-readable layer can differ from what is visually rendered. That immediately pushed attention away from the demo itself and toward the long-standing mess around PDFs. People noted that the format has supported embedded structure, attachments, JavaScript, and accessibility tags for years, but most authoring tools still emit files that look fine and extract badly. A few corrected the post's claim that LaTeX cannot do tagged PDF. Modern LaTeX tooling can, and public-sector accessibility rules already require semantic tagging in many cases. The sharper takeaway was that LLM use did not create this problem. It just turned a niche accessibility and document-engineering issue into an operational one for anyone feeding PDFs into automated systems.

If your product ingests PDFs with AI or automation, do not trust extracted text as ground truth. Build pipelines that validate extractor behavior, prefer tagged or accessibility-friendly PDFs when you control generation, and assume prompt injection or hidden-text tricks will show up in real workflows.

June 12, 2026
sgaud.com
Discuss on HN

Key insights

PDFs already demand hostile-input handling

The useful frame is not "AI makes PDFs dangerous" but "PDFs have always been dangerous and underspecified in practice." Hidden text, scrambled extraction, JavaScript, and parser bugs have been part of real document workflows for years, which is why some teams already rasterize incoming PDFs and run OCR instead of trusting embedded text or active features.

If you process third-party PDFs, treat them like untrusted executables rather than inert documents. Consider flattening, rasterizing, or sandboxing before extraction, especially in automated intake flows.

Attribution:

projektfu #1
dmlittle #1
UltraSane #1

Accessibility tagging is the missing infrastructure

Tagged PDF is not a new invention waiting for AI. It is the accessibility layer that screen readers and regulated publishers already depend on. Modern LaTeX can produce it, and public-sector guidance already spells out how to use it. The novelty here is commercial pressure. LLM parsing may finally force organizations to care about semantic structure they should have been shipping anyway.

When you generate PDFs, invest in tagged output instead of custom extraction hacks. The same semantic markup improves accessibility compliance today and machine readability tomorrow.

Attribution:

Tomte #1
kccqzy #1
al_hag #1

This changes extraction, not what readers see

The core mechanism affects text extraction paths, not visual rendering. That sounds subtle, but it is the whole story. The same person can view one thing and extract another from the same file. That makes the idea useful for structured export, but only when your chosen libraries actually honor replacement text. OCR and some extractors will ignore it and quietly fall back to the messy version.

Test your exact extractor stack against representative PDFs before you rely on this technique. Do not assume another library, browser, or OCR step will preserve the machine-readable layer.

Attribution:

gpvos #1
Xotic007 #1
SarthakGaud #1

Use standard attachments instead of PDF-ZIP tricks

Bundling source files by making one file valid as both PDF and ZIP is clever, but it leans on parser tolerance and breaks as soon as someone resaves the PDF. PDF already has a standard attachment feature, and tools like pdftk can embed source files directly without asking recipients to rename extensions or know a hidden trick.

If you want reproducible documents with source attached, use PDF attachments or embedded files from the spec. Save format hacks for one-off experiments, not documents that will travel through real editing and compliance workflows.

Attribution:

bad_username #1
da_chicken #1
cjs_ac #1

LLM editing now leaves a trust tax

Several readers assumed the article itself was AI-written because of the tone and presentation, even though the author said they wrote it and only used an LLM to polish the English. That reaction matters because it shows how quickly machine polish now changes credibility. For non-native speakers, light editing can erase the human cues readers use to judge authenticity.

If you publish technical writing, keep your own voice even when using editing tools. Heavy rewrite passes can make solid work look synthetic and lower trust before readers engage with the substance.

Attribution:

remywang #1
SarthakGaud #1
dang #1
ugoasidjg #1

Against the grain

Humans always needed machine-readable PDFs too

The AI framing can make this sound like machine-readable structure only matters now that LLMs ingest documents. That is backward. Poor text extraction and inaccessible PDFs have been harming users for decades. Screen readers, search, reuse, and plain ownership of your own documents all depended on better structure long before LLM pipelines showed up.

Do not justify document improvements only through AI readiness. Accessibility, searchability, and reuse remain strong enough reasons to fix your PDF output on their own.

Attribution:

fsckboy #1

HTML and print CSS are often the better answer

Instead of teaching more people to exploit obscure corners of the PDF spec, one commenter argued for making HTML the primary structured document and generating PDF as a print artifact. HTML5 sectioning, accessibility markup, and locale-aware presentation solve many of the same problems in a more familiar stack, even if most sites still fail to use them well.

If you control document creation end to end, compare a web-native publishing pipeline against deeper PDF investment. You may get better accessibility and machine readability by treating PDF as export, not source of truth.

Attribution:

Theodores #1

In plain english

JavaScript ↩

The scripting language built into web browsers and widely used for web applications.

LaTeX ↩

A document preparation system widely used for writing mathematics and scientific papers with precise formatting.

LLM ↩

Large Language Model, a machine learning system trained to generate and analyze text.

Markdown ↩

A lightweight plain-text formatting style that can be converted into nicely formatted documents or web pages.

OCR ↩

Optical character recognition, software that turns scanned images of text into machine-readable text.

pdftk ↩

PDF Toolkit, a command-line tool for combining, splitting, and modifying PDF files.

Tagged PDF ↩

A PDF that includes semantic structure such as headings, paragraphs, and lists so software can understand the document layout and meaning.

Reference links

Accessibility and tagged PDF guidance

Tagged PDF best practice guide
Cited as practical guidance for semantic PDF structure and accessibility tagging.
Section 508 PDF tags and usage guidance
Used to support the claim that publicly funded US organizations are expected to produce semantically tagged PDFs.
LaTeX tagging project status
Provided to correct the claim that LaTeX cannot produce tagged PDFs.
Overleaf introduction to tagged PDF files
Background reading on how tagged PDF works in LaTeX and why accessibility support is hard.

Research on document and AI attacks

Prompt injection via resumes paper
Shared as evidence that hiding model-targeted instructions in resumes has already been studied academically.
Paper on accessibility failures in academic publishing
Used to show that many academic PDFs still fail basic accessibility and structured-access standards.

Project code and related references

adaptivepdf GitHub repository
Repository readers identified as the code behind the blog post demo.
Hacker News search for "own voice" comments
Linked to expand on the moderation advice about preserving authentic writing voice when using LLMs for editing.

Adaptive PDFs

Discussion mood

Key insights

Against the grain

In plain english

Reference links

Accessibility and tagged PDF guidance

Research on document and AI attacks

Project code and related references