HN Debrief

HackerRank open sourced its ATS. My resume scored 90/100. Oh wait 74. No – 88

  • AI
  • Hiring
  • Regulation
  • Machine Learning

The post examined HackerRank’s open-source hiring tool, which uses multiple LLM calls to extract resume data and assign a score out of 100 plus bonuses. Running the same resume repeatedly produced materially different results, enough to move a candidate above or below an arbitrary cutoff. That made the author’s core point easy to grasp even for non-ML readers: if the same input can swing from pass to fail, the score is not a stable measurement. People dug into the mechanics, but the useful conclusion was broader than temperature settings or sampler details. LLM scoring is noisy, and turning the noise deterministic would not fix the underlying issue that the rubric itself is thin, subjective, and badly aligned with actual hiring quality.

If you use LLMs to rank or score candidates, treat the output as noisy triage at best, not a decision. Audit the rubric first, measure variance across repeated runs, and assume legal and fairness risk if public artifacts like GitHub activity stand in for job quality.

Discussion mood

Strongly negative and uneasy. People were frustrated by the score variability, the absurd weighting toward open source and side projects, and the broader normalization of opaque AI triage in hiring. The few pragmatic defenses came from recruiters and hiring managers drowning in application volume, but even they mostly framed it as a grim compromise, not a good system.

Key insights

  1. 01

    Hiring risk is legal as much as technical

    Using a noisy LLM ranker in hiring creates more than an accuracy problem. It creates discoverable evidence of systematic bias. Several comments pointed to GDPR Article 22 in Europe and disparate impact doctrine in the US. A model that overweights GitHub, school signals, names, or other proxies does not need to explicitly ingest race or gender to become a litigation magnet. The Workday case in California came up as proof that courts are already willing to entertain these claims.

    If AI touches hiring decisions, involve counsel and compliance before product or recruiting teams ship it. Keep audit trails, validate for disparate impact, and assume “we only used it as guidance” will not save you if it changes who gets seen.

      Attribution:
    • dathinab #1
    • jerf #1
    • oceansweep #1
    • tikhonj #1
    • buzer #1 #2
  2. 02

    Public GitHub is a class-biased proxy

    The scoring rubric does not just reward engineering skill. It rewards the ability to build and publish in public. That excludes people whose best work is locked behind employer confidentiality, regulated industries, client work, internal tooling, or life constraints outside work. It also favors candidates with time, safety, and cultural familiarity to cultivate an online technical persona. That turns “open source contributions” into a lifestyle filter dressed up as merit.

    Do not let public artifacts dominate candidate evaluation unless the role truly requires public community work. For most roles, ask for evidence that can come from private work samples, scoped exercises, or structured interviews instead.

      Attribution:
    • CM30 #1
    • thewebguyd #1
    • webpraktikos #1
    • Arch-TK #1
    • yobid20 #1
  3. 03

    The prompt design is worse than the model

    The repo’s scoring prompt leaves huge gaps for judgment. It bundles extraction, interpretation, and grading into one pass, then asks for point spreads without tight anchors. Criteria like “substantial community involvement” or “live demo bonus” are underspecified, and forbidden attributes are left in the input instead of removed before scoring. Several comments argued that even a stronger model would still wobble because the rubric is not operationalized enough to produce repeatable measurements.

    Before swapping models, rewrite the task. Separate extraction from evaluation, convert fuzzy ranges into explicit checks, strip irrelevant fields before inference, and validate each sub-step against human-labeled examples.

      Attribution:
    • YossarianFrPrez #1
    • ludicrousdispla #1
    • pu_pe #1
    • Madmallard #1
    • pmarreck #1
  4. 04

    Application volume is the force driving this

    The strongest defense of these tools was not that they work well. It was that teams are buried under hundreds or thousands of applicants and need any way to reduce the pile. That pushed the conversation toward process design rather than model quality. People proposed shorter application windows, batch-based review, stopping after enough strong candidates, and work-sample-first funnels. The common thread was that companies are treating infinite inbound volume as fixed when it is partly a pipeline design choice.

    If hiring volume is overwhelming, redesign intake before adding AI scoring. Limit the top of funnel, collect better signals earlier, and avoid creating a system where low-quality automation is compensating for a broken application process.

      Attribution:
    • jerrythegerbil #1 #2
    • RugnirViking #1
    • Xirdus #1
    • kasey_junk #1
    • conductr #1
  5. 05

    Real resumes exposed concrete hallucinations

    People who ran the tool on their own CVs did not just see score drift. They saw specific factual errors. The model missed known GitHub work, invented Google Summer of Code participation, ignored certifications and awards, and still penalized accomplished candidates for lacking the exact public signals it wanted. One commenter even ran Andrew Ng’s CV through it and got a failing-style score. That makes the problem less abstract than “LLMs are stochastic.” The tool is also plainly extracting facts unreliably.

    Test these systems on gold-standard resumes with known attributes before using them operationally. If extraction is wrong, downstream ranking is noise with a spreadsheet attached.

      Attribution:
    • joshmn #1
    • fernandopj #1 #2
    • robertlagrant #1
    • kdavis #1
  6. 06

    Resume scoring is the wrong object

    A few comments argued that the deeper mistake is trying to infer job fit from resumes at all. One company described replacing resumes with open-ended questionnaire answers tied to company values and role-specific prompts, then using AI only to help sort that richer signal. Others argued for easy work samples or comparative judgments between candidates rather than absolute scores. The useful shift here is away from “make resume parsing better” and toward collecting evidence that is actually predictive.

    If you are redesigning hiring, start by changing the input, not the scorer. Structured writing samples, job-specific questions, and lightweight work tests are more defensible than numeric judgments over a resume blob.

      Attribution:
    • a4isms #1
    • sp2hari #1

Against the grain

  1. 01

    The tool was used for ranking, not hard rejection

    The HackerRank CTO said this repo was a local demo setup using a small model, not the production configuration. He also said the system was meant to sort tens of thousands of intern resumes so humans could read the strongest first, with only very low scores ignored and most applications still reviewed manually. That does not rescue the design, but it does narrow the claim that this exact code was automatically rejecting candidates at scale.

    When evaluating hiring automation, separate demo repos from production use and ask exactly where the model sits in the funnel. Ranking for reviewer order and autonomous rejection carry very different risk, even if both need scrutiny.

      Attribution:
    • sp2hari #1 #2
  2. 02

    Variance can be used as a signal

    A smaller group argued that non-determinism is not always a flaw. If repeated runs produce a wide score distribution, that may reveal the model is uncertain about weak evidence. In that framing, a single score is the bug, while a distribution across many runs is the honest output. That still makes the current implementation bad, but it suggests a better use of LLMs as uncertainty estimators rather than crisp judges.

    If you insist on LLM-based assessment, collect repeated samples and inspect variance instead of trusting one number. High spread should trigger manual review, not an averaged confidence theater.

      Attribution:
    • PaulHoule #1
    • CuriouslyC #1
    • nonethewiser #1
  3. 03

    Cold applications were already a black hole

    Some commenters pushed back on treating LLMs as the reason job searches feel hopeless. They argued that cold applying has long been lossy, recruiter summaries were error-prone before current models, and referrals or direct outreach still dominate outcomes. That does not excuse bad automation. It places it in a pipeline that was already opaque and arbitrary.

    Candidates should not overfit entirely to ATS discourse. Networking, referrals, and direct contact still matter more than polishing for a scorer you cannot see.

      Attribution:
    • rsanek #1
    • us-merul #1
    • seanieb #1

In plain english

Article 22
A section of the GDPR that gives people rights related to decisions made solely by automated processing that significantly affect them.
disparate impact
A legal concept where a policy can be unlawful if it disproportionately harms a protected group, even without explicit intent to discriminate.
GDPR
General Data Protection Regulation, a European Union privacy law that includes rights around personal data handling.
GitHub
A widely used platform for hosting code repositories and collaborating on software projects.
LLM
Large language model, an AI system trained on large amounts of text that can generate and transform language and code.
OSS
Open source software, software whose source code is publicly available for others to inspect, use, and modify.

Reference links

Legal and policy references

Technical references on LLM determinism

Hiring and resume screening examples

Historical and analogy links