User Guide¶
Environment Setup¶
- Install Python 3.10+ and git.
- Create a virtual environment and install dependencies:
- If your data lives elsewhere, update the
data.*paths insideconfig.yaml.
Data Requirements¶
- Training CSV must include
bus_name,description, andbcea_codecolumns. - Place raw files in
data/raw/; preprocessing writes cleaned splits intodata/processed/(train.csv,val.csv,test.csv). - Reference lookups (industry descriptions, mappings) live in
data/reference/anddata/mappings/; keep both synced with the latest BCEA catalogue.
Running the CLI¶
- Full workflow (preprocess → train → evaluate):
- Train on a manageable sample when the dataset is huge:
The subset is always enforced when the flag is supplied, regardless of
training.subset_threshold. - Fine-tune with reviewer corrections (CSV files in
data/corrections/): - Force evaluation of the current promoted model without retraining:
Generating Predictions¶
- Score a new file without retraining:
- Output columns include
predicted_code,industry_description,confidence,confidence_grade,is_fallback, and twoalt_*alternatives. Results are timestamped and saved underdata/processed/. - Enable explanations (if supported by the model checkpoint) with
--explain.
FastAPI Service¶
- Start the server with
python api_server.py; openhttp://localhost:8000/docsfor Swagger UI. - Live endpoints mirror the CLI outputs:
/predict/business,/predict/batch,/predict/csv(multipart upload). - Responses contain the original input, predicted code, industry description, probability, graded confidence, and ranked alternatives.
Troubleshooting¶
- If
models/best_model/is missing, run a training job before serving predictions. - Memory warnings during training usually indicate the batch size is too high—lower
training.batch_sizeor increasegradient_accumulation_steps. - When the CLI appears to ignore your subset request, ensure you passed
--subset-size; setting onlytraining.subset_sizein the config respects thesubset_threshold.