Copier AI Workflow Discovery Agent — Maxime Gagné

Python · Multi-Agent LLM · Process Mining · Docker

AI Workflow Discovery Agent
Diagnose processes. Generate automation.

A hybrid system that ingests raw business data — support tickets, CRM pipelines, SOPs — clusters it deterministically, then passes condensed metrics to a 3-agent LLM pipeline that diagnoses friction, designs an optimized workflow, and compiles a deployment-ready Apache Airflow DAG. ROI is calculated mathematically before any LLM call to eliminate hallucination.

Python 3.11 Gemini Flash instructor Pydantic v2 HDBSCAN sentence-transformers Graphviz Jinja2 Streamlit Docker Apache Airflow
3
Specialized LLM agents
0
Manual JSON parsing (Pydantic contracts)
5
Input formats supported
AST
Airflow DAG validation
🔢

LLMs hallucinate ROI numbers

Standard LLM pipelines produce ROI figures that are plausible but unverifiable. This system computes savings deterministically from cluster data before passing any numbers to the model.

🗂️

Business data is never clean

Exports from Zendesk, HubSpot, Notion, and Slack have completely different shapes. A Sentinel layer (DataRefiner) normalizes every source into atomic actions before analysis.

🏗️

Insight without execution is useless

The system doesn't stop at a report — it compiles the optimized workflow directly into a valid Apache Airflow DAG (.py) using Jinja2 templates and AST validation.

Layer 1 — Deterministic Engine
DataEngine + DataRefiner (Sentinel)
Raw exports (CSV/JSON/SOP text) are ingested and normalized into atomic actions. all-MiniLM-L6-v2 vectorizes the text on CPU (384 dimensions), then HDBSCAN clusters with dynamic density heuristic max(2, 1.5 × ln(n)). Only the mathematical centroids and ROI metrics are forwarded to the LLM — raw data never leaves the container.
Layer 2A — Agent Analyst
Process Diagnosis
Receives clustered payload + pre-computed ROI metrics. Produces a structured DiagnosticAnalyste Pydantic object with friction points, manual task ratio, and validated financial projections.
Layer 2B — Agent Mapper
Workflow Architecture
Transforms the diagnosis into a directed acyclic graph (WorkflowOptimise). A deterministic post-processor classifies every original step into exactly one of: automated / human / eliminated — no ambiguity.
Layer 2C — Agent Advisor
Technology Stack Selection (Single Source of Truth)
Receives the workflow graph and a curated tools catalog (15 tools). Enforces a "Single Source of Truth" architecture — eliminates database redundancy, selects one orchestration tool. Returns a RapportAdvisor with MVP stack vs. complete stack distinction.
Layer 3A — Visualization
SVG Workflow Diagram
Graphviz renders the Pydantic workflow model into an interactive SVG with color-coded node types (trigger, automatic, human, decision, end).
Layer 3B — Code Generation
Airflow DAG Compiler
Jinja2 template compiles workflow nodes into valid Python. Each type_noeud maps deterministically to an Airflow operator: PythonOperator, BashOperator, BranchPythonOperator.
01

Select or upload a data source Sentinel

Choose from three preloaded datasets (Zendesk tickets, HubSpot recruitment pipeline, Notion/Slack marketing content) or upload a custom CSV/JSON/SOP text. The Sentinel validates and restructures unstructured sources into atomic actions automatically.

02

Configure ROI parameters Deterministic

Set the average employer hourly rate ($/h) via sidebar slider. HDBSCAN parameters are also configurable: cluster selection epsilon and minimum cluster size (or dynamic heuristic). ROI is calculated mathematically from cluster data before any LLM is invoked.

03

Semantic clustering HDBSCAN

Text is vectorized via all-MiniLM-L6-v2 and clustered with HDBSCAN. Noise points (cluster -1) are discarded. The 3 most representative examples per cluster are extracted by cosine distance to centroid.

04

3-agent inference pipeline LLM

Analyst → Mapper → Advisor chain runs sequentially via instructor + Gemini Flash. Each agent receives a Pydantic-typed input and returns a Pydantic-validated output. Inference latency and token consumption are tracked per agent.

05

Export results Codegen

Download the full analysis as structured JSON, or export the optimized workflow as a deployment-ready Apache Airflow DAG (.py). The DAG code is validated via AST before download.

🔒

Zero hallucination on financial data

ROI figures are computed by DataEngine.compute_roi() before any LLM call. The Analyst agent is instructed to inject these values verbatim — no recalculation permitted.

📐

Strict Pydantic contracts everywhere

All inter-agent data transfers are typed via schemas.py. instructor forces structured output from Gemini in GEMINI_JSON mode. No manual JSON parsing exists anywhere in the codebase.

🌲

Deterministic transformation summary

compute_transformation() in the Mapper agent classifies each original step into exactly one list (automated / human / eliminated) using set arithmetic — not LLM inference.

📦

Production-ready containerization

Docker image runs under non-root UID 1000. ML model weights are pre-downloaded during build to eliminate cold starts. Graphviz system dependency is baked in.

📡

Built-in observability

MetricsTracker records per-agent latency (seconds) and token consumption (prompt / completion / total) for the full pipeline. Displayed post-run in a telemetry expander.

🧪

Tested architecture

38 unit tests cover Pydantic invariants, HDBSCAN pipeline, ROI math, sanitization edge cases, SVG rendering, Airflow operator mapping, and agent mocking via unittest.mock.

Gemini Flash Lite LLM backbone (3 agents)
instructor Structured LLM output
Pydantic v2 Data contracts & validation
HDBSCAN Density-based clustering
sentence-transformers Text vectorization (CPU)
Graphviz SVG workflow rendering
Jinja2 + AST Airflow DAG generation
Streamlit UI layer (decoupled)
Docker Containerized deployment
pytest + mock Unit test suite

Complete and deployed. The full pipeline — data ingestion, semantic clustering, 3-agent inference, SVG visualization, and Airflow DAG export — is functional and containerized. The repository includes all source files, test suite, Dockerfile, and demo datasets. A Google AI Studio API key is required to run the LLM pipeline.

View repository on GitHub ↗ ← Back to portfolio