HN Debrief

Too many R packages: CRAN is inundated with submissions

  • Programming
  • Developer Tools
  • AI
  • Open Source
  • Data Science

The post says CRAN has become crowded with too many R packages, many of them narrow, redundant, or weakly maintained, and that this makes discovery and review harder. CRAN is the central package repository for R, which matters because R is heavily used in research and statistics by people who often care more about getting an analysis done than about software engineering. That creates a package culture unlike mainstream developer ecosystems. Packages are not just code libraries. They can bundle datasets, wrappers around one paper’s method, or one-off tooling for a lab.

If you rely on R, assume package discovery will keep getting harder and put more weight on curated internal standards, not raw CRAN search. If you run an ecosystem or marketplace, CRAN shows that human gatekeeping still preserves trust, but it does not scale without better filtering and clearer norms about what belongs in the main index.

Discussion mood

Mostly resigned and mildly negative. People broadly accepted that package overload is real and getting worse with AI, but many defended CRAN as one of the few major package ecosystems that still enforces enough human review to remain trustworthy.

Key insights

  1. 01

    CRAN still buys trust with strict review

    CRAN stays usable because it behaves less like an open dump and more like a curated distribution. Human reviewers enforce policies, reject fragile packages, and even remove packages that break ecosystem norms. That friction is painful for authors, but it gives users something rare now: packages that usually install cleanly, handle dependencies sanely, and are less likely to trash the environment.

    Treat a CRAN listing as a stronger trust signal than a package on most language registries, but expect that model to bottleneck as submissions grow. If you maintain internal tooling, mirror CRAN’s policy posture by setting hard rules on packaging, dependency behavior, and error handling before you scale contribution volume.

      Attribution:
    • hadley #1
    • tylermw #1 #2
    • nxobject #1
    • cscheid #1
    • jochapjo #1
  2. 02

    GitHub should absorb the long tail

    A lot of the submission pressure looks avoidable because many packages are not meant for broad public reuse in the first place. For lab tools, team-specific helpers, and niche code with tiny audiences, a GitHub repo already gives easy distribution without consuming central review capacity. The argument here is not that small packages are bad. It is that the main registry works better when it represents software intended to be maintained, documented, and discoverable by strangers.

    If a package is only for your team or one research group, do not default to the central registry. Reserve CRAN for packages you are prepared to document, support, and keep compatible over time.

      Attribution:
    • parsimo2010 #1
    • MostlyStable #1
  3. 03

    Research code and product code optimize for different things

    The recurring complaint about bad R packages lands differently once you accept that many authors are not trying to build reusable software. In research settings, code often exists to answer one question once, under deadline, by people trained in statistics or domain science rather than software engineering. That explains both the rough edges and the value. It also explains why engineers routinely underestimate how much methodological expertise matters, while researchers underestimate the cost they impose on the next person who has to run their code.

    When hiring or evaluating data teams, separate statistical competence from software quality instead of expecting one to imply the other. Add lightweight engineering guardrails around research code early, because someone will eventually need to rerun it after the original author is gone.

      Attribution:
    • mr_toad #1
    • ngriffiths #1
    • mjhay #1
    • buellerbueller #1
    • malshe #1
    • asdff #1
  4. 04

    LLMs remove package friction, not analysis risk

    For academic users, the biggest win from LLMs is brutally practical. They can unblock compilation, find obscure dependencies, and stitch together old code that previously took days or weeks to revive. But that same convenience can deepen the real failure mode, which is weak judgment about preprocessing choices, method selection, and whether a package is the right tool at all. Faster package use does not mean better analysis.

    Use LLMs to accelerate environment setup and code archaeology, then force human review on method choice and data handling. If you teach or manage analysts, grade their decisions and assumptions, not just whether the script runs.

      Attribution:
    • colechristensen #1 #2
    • nxobject #1
    • freehorse #1
  5. 05

    Discovery is now a ranking problem

    Package count by itself is less damaging than bad search and weak filtering. What people need is not a bigger index. They need ways to search by intent and then filter by maintenance status, tests, documentation quality, recent activity, and other signals that distinguish a serious library from a throwaway one. Without that, rewriting from scratch can feel cheaper than searching.

    Build your own package shortlist and quality filters instead of relying on repository browse pages. If you run a platform, invest in ranking and metadata quality before adding more submission throughput.

      Attribution:
    • frogperson #1

Against the grain

  1. 01

    CRAN contains lots of non-software packages

    One commenter argued that CRAN should not be judged like a normal software package registry at all because many packages are really delivery vehicles for datasets or tiny wrappers around a narrow analysis. That makes the ecosystem look stranger and lower quality to engineers than it does to its intended users. Some of the apparent bloat is a mismatch between what software people expect a package repository to contain and what R users actually publish there.

    If you compare package ecosystems across languages, normalize for what a “package” is being used to ship. Policy and tooling for R should account for data distribution and research artifacts, not just reusable libraries.

      Attribution:
    • dizhn #1
  2. 02

    Tidyverse may be the least bad on-ramp

    The anti-tidyverse case got a lot of attention, but several comments made the opposite point more persuasively. Base R is widely seen as inconsistent and full of legacy quirks, while tidyverse gives many users a more uniform mental model and syntax. For newcomers, especially those not trained as programmers, that coherence may reduce package chaos rather than worsen it because it narrows the set of patterns they have to learn.

    Do not assume “fewer packages” automatically means lower complexity for new users. If you are standardizing an R stack for a team or course, consistency of conventions may matter more than ideological purity about base R.

      Attribution:
    • mjhay #1
    • nswizzle31 #1
    • dash2 #1

In plain english

AI
Artificial intelligence, software systems that perform tasks such as prediction, generation, or decision-making that usually require human-like intelligence.
base R
The core R language and standard packages that come with a normal R installation.
CRAN
Comprehensive R Archive Network, the main public repository for R packages.
GitHub
A popular online platform for hosting code repositories and collaborating on software development.
npm
Node Package Manager, the main package ecosystem for JavaScript and Node.js.
PyPI
Python Package Index, the main public repository for Python packages.
R
A programming language and computing environment widely used for statistics, data analysis, and research.
tidyverse
A popular collection of R packages that share a common style for data manipulation, visualization, and analysis.

Reference links

CRAN policies and package rules

  • Writing R Extensions
    Cited as an example of CRAN’s extensive package rules and review requirements.
  • CRAN Repository Policy
    Linked to show specific policy requirements such as restrictions on monkeypatching and expectations for robust error handling.

R language internals and evaluation model

  • Advanced R: Quasiquotation
    Shared to explain the theory behind tidyeval and rebut the claim that tidyverse evaluation semantics are ad hoc or bizarre.