HN Debrief

Adaptive PDFs

  • AI
  • Accessibility
  • Security
  • Developer Tools

The post demonstrates an "adaptive PDF" that uses PDF replacement text and related structure so a human sees a normal designed document, while text extraction tools can pull out cleaner, more structured content such as Markdown. The point is not that the visible PDF changes by reader. It is that the machine-readable layer can differ from what is visually rendered. That immediately pushed attention away from the demo itself and toward the long-standing mess around PDFs. People noted that the format has supported embedded structure, attachments, JavaScript, and accessibility tags for years, but most authoring tools still emit files that look fine and extract badly. A few corrected the post's claim that LaTeX cannot do tagged PDF. Modern LaTeX tooling can, and public-sector accessibility rules already require semantic tagging in many cases. The sharper takeaway was that LLM use did not create this problem. It just turned a niche accessibility and document-engineering issue into an operational one for anyone feeding PDFs into automated systems.

If your product ingests PDFs with AI or automation, do not trust extracted text as ground truth. Build pipelines that validate extractor behavior, prefer tagged or accessibility-friendly PDFs when you control generation, and assume prompt injection or hidden-text tricks will show up in real workflows.

Discussion mood

Interested but wary. People liked the idea and thought it exposed a real gap in PDF tooling, but the mood quickly turned skeptical about extractor compatibility, hidden-text abuse, and the broader sloppiness of treating PDFs as reliable machine input.

Key insights

  1. 01

    PDFs already demand hostile-input handling

    The useful frame is not "AI makes PDFs dangerous" but "PDFs have always been dangerous and underspecified in practice." Hidden text, scrambled extraction, JavaScript, and parser bugs have been part of real document workflows for years, which is why some teams already rasterize incoming PDFs and run OCR instead of trusting embedded text or active features.

    If you process third-party PDFs, treat them like untrusted executables rather than inert documents. Consider flattening, rasterizing, or sandboxing before extraction, especially in automated intake flows.

      Attribution:
    • projektfu #1
    • dmlittle #1
    • UltraSane #1
  2. 02

    Accessibility tagging is the missing infrastructure

    Tagged PDF is not a new invention waiting for AI. It is the accessibility layer that screen readers and regulated publishers already depend on. Modern LaTeX can produce it, and public-sector guidance already spells out how to use it. The novelty here is commercial pressure. LLM parsing may finally force organizations to care about semantic structure they should have been shipping anyway.

    When you generate PDFs, invest in tagged output instead of custom extraction hacks. The same semantic markup improves accessibility compliance today and machine readability tomorrow.

      Attribution:
    • Tomte #1
    • kccqzy #1
    • al_hag #1
  3. 03

    This changes extraction, not what readers see

    The core mechanism affects text extraction paths, not visual rendering. That sounds subtle, but it is the whole story. The same person can view one thing and extract another from the same file. That makes the idea useful for structured export, but only when your chosen libraries actually honor replacement text. OCR and some extractors will ignore it and quietly fall back to the messy version.

    Test your exact extractor stack against representative PDFs before you rely on this technique. Do not assume another library, browser, or OCR step will preserve the machine-readable layer.

      Attribution:
    • gpvos #1
    • Xotic007 #1
    • SarthakGaud #1
  4. 04

    Use standard attachments instead of PDF-ZIP tricks

    Bundling source files by making one file valid as both PDF and ZIP is clever, but it leans on parser tolerance and breaks as soon as someone resaves the PDF. PDF already has a standard attachment feature, and tools like pdftk can embed source files directly without asking recipients to rename extensions or know a hidden trick.

    If you want reproducible documents with source attached, use PDF attachments or embedded files from the spec. Save format hacks for one-off experiments, not documents that will travel through real editing and compliance workflows.

      Attribution:
    • bad_username #1
    • da_chicken #1
    • cjs_ac #1
  5. 05

    LLM editing now leaves a trust tax

    Several readers assumed the article itself was AI-written because of the tone and presentation, even though the author said they wrote it and only used an LLM to polish the English. That reaction matters because it shows how quickly machine polish now changes credibility. For non-native speakers, light editing can erase the human cues readers use to judge authenticity.

    If you publish technical writing, keep your own voice even when using editing tools. Heavy rewrite passes can make solid work look synthetic and lower trust before readers engage with the substance.

      Attribution:
    • remywang #1
    • SarthakGaud #1
    • dang #1
    • ugoasidjg #1

Against the grain

  1. 01

    Humans always needed machine-readable PDFs too

    The AI framing can make this sound like machine-readable structure only matters now that LLMs ingest documents. That is backward. Poor text extraction and inaccessible PDFs have been harming users for decades. Screen readers, search, reuse, and plain ownership of your own documents all depended on better structure long before LLM pipelines showed up.

    Do not justify document improvements only through AI readiness. Accessibility, searchability, and reuse remain strong enough reasons to fix your PDF output on their own.

      Attribution:
    • fsckboy #1
  2. 02

    HTML and print CSS are often the better answer

    Instead of teaching more people to exploit obscure corners of the PDF spec, one commenter argued for making HTML the primary structured document and generating PDF as a print artifact. HTML5 sectioning, accessibility markup, and locale-aware presentation solve many of the same problems in a more familiar stack, even if most sites still fail to use them well.

    If you control document creation end to end, compare a web-native publishing pipeline against deeper PDF investment. You may get better accessibility and machine readability by treating PDF as export, not source of truth.

      Attribution:
    • Theodores #1

In plain english

JavaScript
The main programming language used to add interactivity and application logic in web browsers.
LaTeX
A document preparation system widely used for technical and academic writing, especially when precise typesetting is needed.
LLM
Large language model, a type of AI system trained on large amounts of text to generate and analyze language.
Markdown
A lightweight plain-text formatting syntax used to write structured documents with headings, lists, links, and emphasis.
OCR
Optical Character Recognition, software that converts text in an image or scanned page into machine-readable text.
pdftk
PDF Toolkit, a command-line tool for combining, splitting, and modifying PDF files.
Tagged PDF
A PDF that includes semantic structure such as headings, paragraphs, and lists so software can understand the document layout and meaning.

Reference links

Accessibility and tagged PDF guidance

Research on document and AI attacks

Project code and related references