Developer Guide¶

Architecture Snapshot¶

main.py orchestrates preprocessing, training, fine-tuning, prediction, and evaluation stages driven by CLI flags to stay aligned with the ISCO pipeline.
src/preprocess.py cleans text, joins business names and descriptions, and prepares train/val/test splits with consistent label IDs.
src/model.py wraps Hugging Face transformers for training; it reads training.* options (batch size, optimizations, subset sampling) from the normalized config.
src/predict.py handles batch inference, confidence grading, and alternative prediction ranking for both CLI and API calls.
api/ exposes the FastAPI surface (routers/predict.py) and shares the same artifacts (models/best_model/) as the CLI.

Repository Layout¶

├─ src/              # Core pipeline modules
├─ api/              # FastAPI app, routers, pydantic models
├─ data/             # raw/ processed/ mappings/ review/ reference/
├─ models/           # Timestamped runs and best_model/
├─ logs/             # Rich console and trainer logs
├─ docs/             # MkDocs content (this site)
├─ config.yaml       # Central configuration
├─ requirements.txt  # Runtime dependencies
└─ requirements-docs.txt  # MkDocs extras

Coding Guidelines¶

Follow PEP 8 with four-space indentation and descriptive snake_case identifiers; keep classes in PascalCase.
Mirror existing docstrings (triple double-quotes + Args/Returns) and use logging helpers from src/utils.py for consistency.
Store tunables in config.yaml; CLI flags override them through _normalize_config in src/utils.py.
Prefer pure functions and explicit dependency injection—pass configs and paths instead of relying on globals.

Working With Data & Models¶

Preprocessing writes train.csv, val.csv, and test.csv under data/processed/; each row carries combined_text, bcea_code, and label_id.
The best model promotion logic compares metrics against models/best_model/metrics.json unless --force-update-best is set.
Subset sampling is controlled via training.subset_size/subset_threshold; the CLI flag --subset-size sets subset_force so the trainer always samples the requested size.

Testing & Validation¶

Run lightweight smoke checks with python -m compileall main.py src api (mirrors the existing CI sanity check).
Add formal tests under a future tests/ directory using pytest; stub out small CSV fixtures to reach preprocessing and prediction branches.
Before promoting a build, capture metrics with python main.py --skip-training --evaluate and archive the generated HTML report in logs/.

Extending the API¶

Pydantic models live in api/models.py; augment them when adding fields so both CLI and API responses stay in sync.
Shared helpers (e.g., confidence grading) should remain in src/ and be imported by the FastAPI layer—avoid duplicating business logic inside api/.
Expose new routes by wiring routers in api/main.py and documenting them in docs/api.md.