At a glance
The problem it solves
LLMs hallucinate ROI numbers
Standard LLM pipelines produce ROI figures that are plausible but unverifiable. This system computes savings deterministically from cluster data before passing any numbers to the model.
Business data is never clean
Exports from Zendesk, HubSpot, Notion, and Slack have completely different shapes. A Sentinel layer (DataRefiner) normalizes every source into atomic actions before analysis.
Insight without execution is useless
The system doesn't stop at a report — it compiles the optimized workflow directly into a valid Apache Airflow DAG (.py) using Jinja2 templates and AST validation.
System architecture
all-MiniLM-L6-v2 vectorizes the text on CPU (384 dimensions),
then HDBSCAN clusters with dynamic density heuristic max(2, 1.5 × ln(n)).
Only the mathematical centroids and ROI metrics are forwarded to the LLM — raw data never leaves the container.
DiagnosticAnalyste Pydantic object with friction points, manual task ratio, and validated financial projections.WorkflowOptimise). A deterministic post-processor classifies every original step into exactly one of: automated / human / eliminated — no ambiguity.RapportAdvisor with MVP stack vs. complete stack distinction.
type_noeud maps deterministically to an Airflow operator: PythonOperator, BashOperator, BranchPythonOperator.User flow
Select or upload a data source Sentinel
Choose from three preloaded datasets (Zendesk tickets, HubSpot recruitment pipeline, Notion/Slack marketing content) or upload a custom CSV/JSON/SOP text. The Sentinel validates and restructures unstructured sources into atomic actions automatically.
Configure ROI parameters Deterministic
Set the average employer hourly rate ($/h) via sidebar slider. HDBSCAN parameters are also configurable: cluster selection epsilon and minimum cluster size (or dynamic heuristic). ROI is calculated mathematically from cluster data before any LLM is invoked.
Semantic clustering HDBSCAN
Text is vectorized via all-MiniLM-L6-v2 and clustered with HDBSCAN. Noise points (cluster -1) are discarded. The 3 most representative examples per cluster are extracted by cosine distance to centroid.
3-agent inference pipeline LLM
Analyst → Mapper → Advisor chain runs sequentially via instructor + Gemini Flash. Each agent receives a Pydantic-typed input and returns a Pydantic-validated output. Inference latency and token consumption are tracked per agent.
Export results Codegen
Download the full analysis as structured JSON, or export the optimized workflow as a deployment-ready Apache Airflow DAG (.py). The DAG code is validated via AST before download.
Key engineering decisions
Zero hallucination on financial data
ROI figures are computed by DataEngine.compute_roi() before any LLM call. The Analyst agent is instructed to inject these values verbatim — no recalculation permitted.
Strict Pydantic contracts everywhere
All inter-agent data transfers are typed via schemas.py. instructor forces structured output from Gemini in GEMINI_JSON mode. No manual JSON parsing exists anywhere in the codebase.
Deterministic transformation summary
compute_transformation() in the Mapper agent classifies each original step into exactly one list (automated / human / eliminated) using set arithmetic — not LLM inference.
Production-ready containerization
Docker image runs under non-root UID 1000. ML model weights are pre-downloaded during build to eliminate cold starts. Graphviz system dependency is baked in.
Built-in observability
MetricsTracker records per-agent latency (seconds) and token consumption (prompt / completion / total) for the full pipeline. Displayed post-run in a telemetry expander.
Tested architecture
38 unit tests cover Pydantic invariants, HDBSCAN pipeline, ROI math, sanitization edge cases, SVG rendering, Airflow operator mapping, and agent mocking via unittest.mock.
Technology stack
Project status
Complete and deployed. The full pipeline — data ingestion, semantic clustering, 3-agent inference, SVG visualization, and Airflow DAG export — is functional and containerized. The repository includes all source files, test suite, Dockerfile, and demo datasets. A Google AI Studio API key is required to run the LLM pipeline.