Good results fine tuning a local LLM like Qwen 3:0.6B to categorize questions

AI
Machine Learning
Developer Tools

The post walks through fine-tuning Qwen 3 0.6B, a small local decoder-only language model, to map incoming questions like household queries into one of a fixed set of categories before retrieval. The appeal is obvious for startup teams. It runs locally, trains on a modest dataset, and gives a concrete recipe rather than hand-wavy LLM talk. But the useful conclusion people reached is that the experiment proves more about how forgiving classification is than about Qwen being the right tool. For a closed label set with roughly 800 examples, most of the strong technical feedback pointed to older and narrower approaches that are built for classification. BERT-style encoders, ModernBERT, embedding plus logistic regression, k-nearest neighbors over category embeddings, and even sparse text models all came up as likely better baselines on accuracy, speed, model size, and deployment simplicity.

If your problem is closed-set text classification, do not default to fine-tuning a decoder-only LLM just because it is local and small. Benchmark logistic regression, embedding-based classifiers, and BERT-style encoders first, then keep the LLM only if you need generative behavior or more flexible reasoning.

June 22, 2026
teachmecoolstuff.com
Discuss on HN

Discussion mood

Interested and upbeat about the hands-on write-up, but mostly skeptical of the model choice. The dominant reaction was that decoder-only small LLMs are a clumsy way to solve straightforward classification when encoder models, embeddings, or classic linear methods are faster, simpler, and often more accurate.

Key insights

Logistic regression beat the fine-tuned LLM

The strongest update is that a simple logistic regression follow-up did not just hold its own. It improved accuracy and cut both training and inference cost. That changes the interpretation of the original result. The interesting story is no longer that tiny LLM fine-tuning works. It is that the extra complexity was unnecessary for this task.

Treat the LLM result as a baseline to beat, not a deployment choice. If you already have embeddings or can produce them cheaply, train a linear classifier before spending time on LoRA or full fine-tuning.

Attribution:

dev-experiments #1 #2
nl #1

Encoder models fit this task better

For fixed-label classification, bidirectional encoder models like BERT and ModernBERT match the objective far better than decoder-only models like Qwen or Gemma. One commenter reported ModernBERT Large clearly beating Gemma 1B on a binary classification task and learning faster. Others pointed out that non-autoregressive transformers with a classification head are already absurdly effective here and package cleanly for deployment through ONNX.

If you need a trainable neural model rather than a linear baseline, start with an encoder and a classifier head. You are likely to get better accuracy with less compute and an easier serving path than a generative model gives you.

Attribution:

kamranjon #1
GardenLetter27 #1
doubtfuluser #1
stephantul #1
all2 #1

Invented labels are an output-control problem

The model making up categories like "apartments" is mostly a decoding issue, not a reason to abandon the approach. Runtimes such as llama.cpp can enforce a grammar or use logit masking so invalid labels are impossible to emit. That means one of the headline failure modes should have been removed at serving time instead of treated as a model capability problem.

When your output space is small and known, enforce it in decoding. Do this before comparing models, otherwise you are penalizing one approach for a bug in the wrapper rather than a weakness in the classifier.

Attribution:

nl #1
thomascountz #1
mijoharas #1

There is a rich middle ground

The useful alternatives are not just "regex or LLM." Several commenters sketched a full ladder of options between 2-grams and a 600 million parameter generator. Sparse n-gram models, FastText-style embeddings, sentence embeddings plus cosine similarity or k-nearest neighbors, BERT variants, and small multilayer perceptron or support vector machine classifiers all fit this problem. Synthetic data generation, active learning, and hard-example creation can improve those pipelines without forcing you into end-to-end LLM fine-tuning.

Build classification systems as a benchmark stack, not a single-model bet. Compare at least one sparse baseline, one embedding baseline, and one encoder model so you know what the LLM is actually buying you.

Attribution:

nl #1
zubiaur #1
deepsquirrelnet #1 #2
armcat #1
electroglyph #1

Category routing only helps if retrieval changes

Mapping a query to a category is only valuable when it drives a different retrieval strategy, such as querying separate indexes or applying different ranking logic. If the query can cross categories, a hard single-label classifier may lose recall instead of helping. The classification stage needs to be justified by downstream routing, not added because it feels structured.

Before adding a categorizer in front of retrieval, define what changes after the label is assigned. If nothing changes except metadata, you may be adding latency and failure modes without improving search.

Attribution:

mettamage #1
pj_mukh #1

Against the grain

Good enough may be good enough

A fair pushback to the "use BERT instead" chorus is that there are many models between 2-grams and Qwen 0.6B, and if this model solves the actual business problem then it is not automatically a bad choice. The critique is mostly about optimality, not viability. That matters if the team already has local LLM infrastructure and values one reusable model family over a task-specific stack.

Do not over-optimize architecture purity if the system already meets your latency, cost, and accuracy targets. Standardize on a suboptimal model only when the operational simplicity is real and measurable.

Attribution:

brokensegue #1

LLMs can bootstrap the training data

One commenter argued that the more interesting role for LLMs here is not as the final classifier but as the data engine behind it. They can generate labeled examples, simulate active learning, or implement weak supervision in the spirit of Snorkel. That makes the model choice less binary. Even if a linear or encoder classifier wins at inference time, LLMs may still be the fastest way to build the dataset.

If your bottleneck is annotation rather than inference, use an LLM upstream to create or expand training data. Then train the smallest classifier that holds up on real examples.

Attribution:

IanCal #1

In plain english

BERT ↩

Bidirectional Encoder Representations from Transformers, a transformer model architecture designed to read text in both directions and commonly used for classification.

cosine similarity ↩

A way to measure how similar two vectors are by comparing their direction rather than their size.

decoder-only ↩

A model architecture that generates text one token at a time using only the text that came before.

embedding ↩

A numeric vector representation of a token or other input feature that the model can process.

FastText ↩

A lightweight text representation and classification approach from Meta that is known for speed and small model size.

k-nearest neighbors ↩

A method that predicts a label by finding the most similar examples in a stored set.

llama.cpp ↩

An open source project for running language models efficiently on local hardware.

LLM ↩

Large Language Model, an AI system trained to generate and analyze text and code.

logistic regression ↩

A simple statistical classification method often used as a strong baseline for text categorization.

logit masking ↩

A decoding technique that blocks invalid output tokens by setting their scores so low they cannot be chosen.

ModernBERT ↩

A newer BERT-style encoder model optimized for strong text understanding and classification performance.

n-gram ↩

A sequence of n words or characters used as a feature in traditional language models and classifiers.

ONNX ↩

Open Neural Network Exchange, a standard format for moving machine learning models between tools and deployment environments.

Qwen 3 0.6B ↩

A small version of the Qwen language model with about 0.6 billion parameters.

Snorkel ↩

A system for weak supervision that uses noisy labeling rules to create training data for machine learning models.

Reference links

Follow-up experiments from the author

Using logistic regression to categorize questions
The author's follow-up experiment reported better accuracy and performance than the fine-tuned LLM.
Logistic regression section in the GitHub repo README
A commenter pointed out that the repository already mentioned a logistic regression baseline that was not discussed in the original post.

Alternative modeling approaches

Ways to use NLI x encoder models
Suggested as a path for going deeper on classification with natural language inference and encoder models.
UK Government Incubator for AI consult classifier pipeline
Shared as a real-world example of an LLM-driven survey answer classification pipeline.