Show HN: Autofit2 – End-to-end pipeline for multilingual text classification

22 points by leschak 2 days ago|2 comments

Hi HN, Stefan here. autofit2 is a project I have been using at my previous company and is now opensourced. It has been used extensively in automated text moderation, but can be applied to any text/document classification task. We had success modeling offensive texts in 20+ languages (cf. github.com/neospe/dataload for all the datasets).

It's an integrated pipeline for lightweight multilingual text classification, covering preprocessing, training, and evaluation. It implements SetFit, a few-shot learning technique that works well for low-data regimes (down to a few dozen examples), and offers high throughput on CPUs, since it's based on Sentence Transformers. Dependencies are kept lean, but of course PyTorch itself isn't exactly small.

autofit2 takes a base model and a JSON config as input, and outputs a TorchServe model archive as well as a model card. The model card includes any benchmarks you have for your task, self-consistency tests, estimated CO2 emissions of the finetune, as well as an entropy-based bias analysis. For the bias eval, small test corpora for 50 languages are included. It works best with my EAR (Entropy-based Attention Regularization) fork of Sentence Transformers.

Feedback is welcome.

•

nmstoker 11 hours ago

How does this differ from SetFit? Is it just an alternative implementation?

I found the HF version pretty effective and it often works well for multilingual classification. I've used it for intent matching and was pleasantly surprised that Polish, German and other translations of our intents tended to work "for free" when training with just English training data!

https://github.com/huggingface/setfit

•

leschak 4 hours ago

Yes, this is an alternative original implementation, from four years ago, when the concept of SetFit was still new and HF's project didn't exist. I guess its value nowadays lies in its simplicity. It is really simple. And practicable if you use TorchServe, because embeddings and classification model get serialized into one object.