Why Track Model Performance?
Different AI models have different strengths:- GPT-4 might excel at complex reasoning but cost more
- Claude 3.5 Sonnet might provide better customer satisfaction
- Llama 3 might offer the best balance of cost and quality for your specific use case
- See which models deliver the best user satisfaction
- Compare quality metrics across models
- Track performance trends over time
- Make informed decisions about model switching
Sending Model Data
Conversation-Level Model
The simplest approach is to specify the model at the conversation level. All messages in the conversation will be associated with this model.Per-Message Model (Multi-Agent Scenarios)
For agentic workflows where different steps use different models, you can specify the model at the message level:Combining Conversation and Message Models
You can set a default model at the conversation level and override it for specific messages:combined-models.ts
Model Normalization
Greenflash automatically normalizes model strings to canonical identifiers. This means you can send various formats and they’ll be matched to the correct model:| You Send | Canonical ID |
|---|---|
gpt-4 | openai/gpt-4 |
gpt4 | openai/gpt-4 |
gpt-4-turbo-preview | openai/gpt-4-turbo |
claude-3.5-sonnet | anthropic/claude-3-5-sonnet |
claude-3-5-sonnet-20241022 | anthropic/claude-3-5-sonnet |
Custom Models
If you’re using a custom or fine-tuned model, Greenflash will create a custom model entry for it:custom-model.ts
source: 'customer' tag, distinguishing them from standard models.
Viewing Model Performance
Models List
Navigate to the Models page in your dashboard to see all models used across your products:- Usage Count: How many conversations have used each model
- Quality Index: Average conversation quality score
- Satisfaction Score: User satisfaction with responses
- Provider: OpenAI, Anthropic, Google, etc.
- Provider (OpenAI, Anthropic, etc.)
- Product
- Date range
- Usage count
- Quality index
- Last used date
Model Detail Page
Click on any model to see detailed analytics:- Quality trends over 7, 30, or 90 days
- Four pillar scores: Satisfaction, Growth, Friction, Safety
- Sentiment analysis: Average user sentiment and trends
- Products using this model: See which products rely on this model
Comparing Models
Greenflash makes it easy to compare models side-by-side. You can start a comparison from multiple places:From the Model Detail Page
- Navigate to any model’s detail page
- Click the Compare button in the header
- Select additional models to compare (up to 5 total)
From the Models List
- Go to the Models page
- Click the ⋮ menu on any model card
- Select Compare
- The comparison dialog opens with that model pre-selected
The Comparison Dialog
Use the searchable Add model dropdown to find models by name or provider. Models are grouped by whether they have analytics data available, so you can focus on models with meaningful metrics. The comparison view shows metrics side-by-side with visual indicators:| Metric | Description |
|---|---|
| Quality Index | Overall conversation quality (0-100) |
| Satisfaction | User satisfaction percentage |
| Safety | Safety score percentage |
| Hallucination Rate | Percentage of conversations with hallucinations |
| Jailbreak Rate | Percentage of jailbreak attempts |
| Conversations | Total conversation count |
| Input/Output Cost | Cost per 1K tokens |
Model performance varies by use case. A model that performs well for customer support might not be optimal for code generation. Always compare models within the context of a specific product.
Model Recommendations
Greenflash automatically analyzes your model performance and generates actionable recommendations to help you optimize. This feature combines rules-based issue detection with AI-powered synthesis to surface the single most important optimization opportunity for each model.How It Works
- Issue Detection - We continuously monitor metrics against thresholds to detect problems like high hallucination rates, user frustration, declining quality, and stale models with newer alternatives available
- Conversation Analysis - We sample representative conversations across the quality distribution to understand patterns and gather evidence
- AI Synthesis - An LLM synthesizes detected issues, conversation patterns, topic distribution, cost data, and user ratings into one personalized, actionable recommendation grounded in your product’s actual data
What You’ll See
When recommendations are available, the model detail page shows:- Summary - A personalized assessment of model health tailored to your product’s use case
- Strengths - What’s working well for your specific product (up to 3)
- Recommendation - One clear, prioritized action to take, including:
- Action - Specific step to take
- Rationale - Why this matters and what pattern was observed
- Evidence - Example snippets from actual conversations
- Expected Impact - Which metrics should improve (quality, satisfaction, safety, cost, frustration)
- Category - Type of change: model switch, cost optimization, use case fit, or prompt optimization link
Stale Model Detection
Greenflash maintains a catalog of known model upgrade paths across providers. When a newer model from the same provider is available at equal or lower cost, we flag it as a detected issue and factor it into the recommendation. This makes it easy to identify low-risk upgrades where you get a newer model without paying more.Deep Links to Conversations
Each recommendation includes links to view the relevant conversations. Click through to see:- Conversations with detected issues (hallucinations, jailbreaks, etc.)
- High and low quality conversation examples
- Patterns that informed the recommendation
Understanding Model Metrics
Quality Index
The Conversation Quality Index measures how well the model handled conversations on a 0-100 scale. It’s calculated from:- User satisfaction signals
- Conversation outcomes
- Safety metrics
Pillar Scores
Each model is scored on four pillars:| Pillar | What it measures |
|---|---|
| Satisfaction | Did users get what they needed? |
| Growth | Did the interaction drive positive outcomes? |
| Friction | How smooth was the interaction? |
| Safety | Were there any safety issues? |
Sentiment Analysis
Track how users feel during conversations:- Average sentiment: Positive, neutral, or negative
- Sentiment change: Did users’ sentiment improve or decline?
Best Practices
1. Track All Model Usage
Always send themodel parameter, even for simple integrations:
always-track.ts
2. Use Consistent Naming
Pick a naming convention and stick to it. Greenflash will normalize, but consistency helps:3. Segment by Product
Model performance varies by use case. Create separate products for different use cases to get meaningful comparisons:4. A/B Test Models
Use sampling to compare models on the same traffic:ab-test.ts

