HackerRank open sourced its ATS. My resume scored 90/100. Oh wait 74. No – 88

AI
Hiring
Regulation
Machine Learning

The post examined HackerRank’s open-source hiring tool, which uses multiple LLM calls to extract resume data and assign a score out of 100 plus bonuses. Running the same resume repeatedly produced materially different results, enough to move a candidate above or below an arbitrary cutoff. That made the author’s core point easy to grasp even for non-ML readers: if the same input can swing from pass to fail, the score is not a stable measurement. People dug into the mechanics, but the useful conclusion was broader than temperature settings or sampler details. LLM scoring is noisy, and turning the noise deterministic would not fix the underlying issue that the rubric itself is thin, subjective, and badly aligned with actual hiring quality.

The sharpest comments landed on two failures in the design. First, the prompt asks the model to make too many fuzzy judgments at once, with vague point ranges and poorly defined criteria like “significant contributions” or “architectural complexity.” That invites variance even before model randomness enters. Second, the scoring weights are bizarre for experienced engineers. Open source work and personal projects dominate the score, while years of real work count for much less. That systematically favors people with time and freedom to build in public, and punishes candidates in defense, consulting, private enterprise, caregiving, or simply anyone who does not spend evenings maintaining a public GitHub. Several people noted that this is exactly how you manufacture proxy discrimination while pretending to be objective. From there the discussion converged on a pragmatic view of what these systems are actually doing. In high-volume hiring, many teams are not searching for an ideal evaluator. They are searching for any cheap way to shrink a pile of applications. That makes a mediocre filter tempting even when it is barely better than random, because the real alternative is often rushed humans, arbitrary ordering effects, or a pile nobody reads. But commenters were blunt that this is still a choice, not a law of nature. You can cap application windows, sample in waves, use structured work samples, compare candidates pairwise, or ask domain-specific questions instead of pretending a resume parser can infer engineering ability from OSS stars and blog links. Late in the thread, the HackerRank CTO clarified that this repo was a demo configuration, not their production setup, and said it was built to rank tens of thousands of intern resumes rather than fully automate decisions. That softened one narrow claim about how this exact code is used. It did not change the main takeaway. Once you let an LLM produce scores inside a hiring pipeline, people will trust the number more than they should, and tiny implementation choices become policy. The comments treated that as the real scandal: not that one open-source repo is sloppy, but that the industry keeps wrapping uncertain model judgments in numeric authority and calling it process.

If you use LLMs to rank or score candidates, treat the output as noisy triage at best, not a decision. Audit the rubric first, measure variance across repeated runs, and assume legal and fairness risk if public artifacts like GitHub activity stand in for job quality.

June 29, 2026
danunparsed.com
Discuss on HN

Discussion mood

Strongly negative and uneasy. People were frustrated by the score variability, the absurd weighting toward open source and side projects, and the broader normalization of opaque AI triage in hiring. The few pragmatic defenses came from recruiters and hiring managers drowning in application volume, but even they mostly framed it as a grim compromise, not a good system.

Key insights

Hiring risk is legal as much as technical

Using a noisy LLM ranker in hiring creates more than an accuracy problem. It creates discoverable evidence of systematic bias. Several comments pointed to GDPR Article 22 in Europe and disparate impact doctrine in the US. A model that overweights GitHub, school signals, names, or other proxies does not need to explicitly ingest race or gender to become a litigation magnet. The Workday case in California came up as proof that courts are already willing to entertain these claims.

If AI touches hiring decisions, involve counsel and compliance before product or recruiting teams ship it. Keep audit trails, validate for disparate impact, and assume “we only used it as guidance” will not save you if it changes who gets seen.

Attribution:

dathinab #1
jerf #1
oceansweep #1
tikhonj #1
buzer #1 #2

Public GitHub is a class-biased proxy

The scoring rubric does not just reward engineering skill. It rewards the ability to build and publish in public. That excludes people whose best work is locked behind employer confidentiality, regulated industries, client work, internal tooling, or life constraints outside work. It also favors candidates with time, safety, and cultural familiarity to cultivate an online technical persona. That turns “open source contributions” into a lifestyle filter dressed up as merit.

Do not let public artifacts dominate candidate evaluation unless the role truly requires public community work. For most roles, ask for evidence that can come from private work samples, scoped exercises, or structured interviews instead.

Attribution:

CM30 #1
thewebguyd #1
webpraktikos #1
Arch-TK #1
yobid20 #1

The prompt design is worse than the model

The repo’s scoring prompt leaves huge gaps for judgment. It bundles extraction, interpretation, and grading into one pass, then asks for point spreads without tight anchors. Criteria like “substantial community involvement” or “live demo bonus” are underspecified, and forbidden attributes are left in the input instead of removed before scoring. Several comments argued that even a stronger model would still wobble because the rubric is not operationalized enough to produce repeatable measurements.

Before swapping models, rewrite the task. Separate extraction from evaluation, convert fuzzy ranges into explicit checks, strip irrelevant fields before inference, and validate each sub-step against human-labeled examples.

Attribution:

YossarianFrPrez #1
ludicrousdispla #1
pu_pe #1
Madmallard #1
pmarreck #1

Application volume is the force driving this

The strongest defense of these tools was not that they work well. It was that teams are buried under hundreds or thousands of applicants and need any way to reduce the pile. That pushed the conversation toward process design rather than model quality. People proposed shorter application windows, batch-based review, stopping after enough strong candidates, and work-sample-first funnels. The common thread was that companies are treating infinite inbound volume as fixed when it is partly a pipeline design choice.

If hiring volume is overwhelming, redesign intake before adding AI scoring. Limit the top of funnel, collect better signals earlier, and avoid creating a system where low-quality automation is compensating for a broken application process.

Attribution:

jerrythegerbil #1 #2
RugnirViking #1
Xirdus #1
kasey_junk #1
conductr #1

Real resumes exposed concrete hallucinations

People who ran the tool on their own CVs did not just see score drift. They saw specific factual errors. The model missed known GitHub work, invented Google Summer of Code participation, ignored certifications and awards, and still penalized accomplished candidates for lacking the exact public signals it wanted. One commenter even ran Andrew Ng’s CV through it and got a failing-style score. That makes the problem less abstract than “LLMs are stochastic.” The tool is also plainly extracting facts unreliably.

Test these systems on gold-standard resumes with known attributes before using them operationally. If extraction is wrong, downstream ranking is noise with a spreadsheet attached.

Attribution:

joshmn #1
fernandopj #1 #2
robertlagrant #1
kdavis #1

Resume scoring is the wrong object

A few comments argued that the deeper mistake is trying to infer job fit from resumes at all. One company described replacing resumes with open-ended questionnaire answers tied to company values and role-specific prompts, then using AI only to help sort that richer signal. Others argued for easy work samples or comparative judgments between candidates rather than absolute scores. The useful shift here is away from “make resume parsing better” and toward collecting evidence that is actually predictive.

If you are redesigning hiring, start by changing the input, not the scorer. Structured writing samples, job-specific questions, and lightweight work tests are more defensible than numeric judgments over a resume blob.

Attribution:

a4isms #1
sp2hari #1

Against the grain

The tool was used for ranking, not hard rejection

The HackerRank CTO said this repo was a local demo setup using a small model, not the production configuration. He also said the system was meant to sort tens of thousands of intern resumes so humans could read the strongest first, with only very low scores ignored and most applications still reviewed manually. That does not rescue the design, but it does narrow the claim that this exact code was automatically rejecting candidates at scale.

When evaluating hiring automation, separate demo repos from production use and ask exactly where the model sits in the funnel. Ranking for reviewer order and autonomous rejection carry very different risk, even if both need scrutiny.

Attribution:

sp2hari #1 #2

Variance can be used as a signal

A smaller group argued that non-determinism is not always a flaw. If repeated runs produce a wide score distribution, that may reveal the model is uncertain about weak evidence. In that framing, a single score is the bug, while a distribution across many runs is the honest output. That still makes the current implementation bad, but it suggests a better use of LLMs as uncertainty estimators rather than crisp judges.

If you insist on LLM-based assessment, collect repeated samples and inspect variance instead of trusting one number. High spread should trigger manual review, not an averaged confidence theater.

Attribution:

PaulHoule #1
CuriouslyC #1
nonethewiser #1

Cold applications were already a black hole

Some commenters pushed back on treating LLMs as the reason job searches feel hopeless. They argued that cold applying has long been lossy, recruiter summaries were error-prone before current models, and referrals or direct outreach still dominate outcomes. That does not excuse bad automation. It places it in a pipeline that was already opaque and arbitrary.

Candidates should not overfit entirely to ATS discourse. Networking, referrals, and direct contact still matter more than polishing for a scorer you cannot see.

Attribution:

rsanek #1
us-merul #1
seanieb #1

In plain english

Article 22 ↩

A section of the GDPR that gives people rights related to decisions made solely by automated processing that significantly affect them.

disparate impact ↩

A legal concept where a policy can be unlawful if it disproportionately harms a protected group, even without explicit intent to discriminate.

GDPR ↩

General Data Protection Regulation, a European Union privacy law that includes rights around personal data handling.

GitHub ↩

A widely used platform for hosting code repositories and collaborating on software projects.

LLM ↩

Large language model, an AI system trained on large amounts of text that can generate and transform language and code.

OSS ↩

Open source software, software whose source code is publicly available for others to inspect, use, and modify.

Reference links

Legal and policy references

Workday loses bid to toss AI discrimination suit in California
Cited as a current example of AI hiring tools facing discrimination litigation
Cornell Wex on preponderance of the evidence
Used to explain the lower evidentiary standard in civil suits compared with criminal cases
EU AI Act text
Referenced in discussion of high-risk AI systems for recruitment and selection in the European Union
Stanford HAI on AI hiring tools and racial bias
Shared as evidence that AI hiring tools can produce racial bias and systemic rejection
Congressional Research Service on disparate impact
Provided to support the point that US law can recognize discrimination through correlated outcomes

Technical references on LLM determinism

Defeating nondeterminism in LLM inference
Explains practical sources of nondeterminism in model inference beyond temperature settings
PyTorch deterministic algorithms docs
Referenced in debate over whether GPU operations can be made deterministic
Fireworks AI post on deterministic inference and batching
Shared as an example of the work needed to get truly deterministic output under batching

Hiring and resume screening examples

HackerRank hiring-agent repository
The open-source repo under discussion and a target of community PRs and issue reports
Known GSoC hallucination issue in hiring-agent
Linked to document a concrete bug where the tool invents Google Summer of Code participation
Amazon hiring algorithm bias example and NPR discrimination report card coverage
Used to illustrate how resume-screening systems can discriminate by proxy features
Quartz on liability for biased hiring algorithms
Referenced alongside Amazon’s abandoned hiring model as a cautionary precedent

Historical and analogy links

I Don’t Hire Unlucky People
Shared because the ATS behavior resembled the old joke about discarding resumes at random
Secretary problem
Used in discussion of sequential candidate selection and random filtering alternatives