Skip to main content
Greenflash provides comprehensive model tracking and analytics to help you understand which AI models perform best for your use cases. By tracking model performance over time and across products, you can make data-driven decisions about model selection and optimization.

Why Track Model Performance?

Different AI models have different strengths:
  • GPT-4 might excel at complex reasoning but cost more
  • Claude 3.5 Sonnet might provide better customer satisfaction
  • Llama 3 might offer the best balance of cost and quality for your specific use case
Without tracking, you’re guessing. With Greenflash, you can:
  • See which models deliver the best user satisfaction
  • Compare quality metrics across models
  • Track performance trends over time
  • Make informed decisions about model switching

Sending Model Data

Conversation-Level Model

The simplest approach is to specify the model at the conversation level. All messages in the conversation will be associated with this model.
const params = {
  productId: 'YOUR_PRODUCT_ID',
  externalConversationId: 'YOUR_CONVERSATION_ID',
  externalUserId: 'YOUR_USER_ID',
  model: 'gpt-4-turbo',  // Track which model was used
  messages: [
    { role: 'user', content: 'Explain quantum computing' },
    { role: 'assistant', content: 'Quantum computing uses quantum mechanics...' }
  ]
};

client.messages.create(params);

Per-Message Model (Multi-Agent Scenarios)

For agentic workflows where different steps use different models, you can specify the model at the message level:
const params = {
  productId: 'YOUR_PRODUCT_ID',
  externalConversationId: 'YOUR_CONVERSATION_ID',
  externalUserId: 'YOUR_USER_ID',
  messages: [
    {
      role: 'user',
      content: 'Help me write and review a blog post'
    },
    {
      messageType: 'assistant_message',
      content: 'Here is a draft blog post about...',
      model: 'gpt-4-turbo'  // Writing step uses GPT-4
    },
    {
      messageType: 'thought',
      content: 'Reviewing the draft for clarity and accuracy...',
      model: 'claude-3-5-sonnet'  // Review step uses Claude
    },
    {
      messageType: 'final_response',
      content: 'Here is your revised blog post...',
      model: 'claude-3-5-sonnet'
    }
  ]
};

client.messages.create(params);

Combining Conversation and Message Models

You can set a default model at the conversation level and override it for specific messages:
combined-models.ts
const params = {
  productId: 'YOUR_PRODUCT_ID',
  externalConversationId: 'YOUR_CONVERSATION_ID',
  externalUserId: 'YOUR_USER_ID',
  model: 'gpt-3.5-turbo',  // Default model for most messages
  messages: [
    { role: 'user', content: 'Simple question' },
    { role: 'assistant', content: 'Simple answer' },  // Uses gpt-3.5-turbo
    { role: 'user', content: 'Complex question requiring deep reasoning' },
    {
      role: 'assistant',
      content: 'Detailed answer...',
      model: 'gpt-4-turbo'  // Override: use GPT-4 for complex reasoning
    }
  ]
};

Model Normalization

Greenflash automatically normalizes model strings to canonical identifiers. This means you can send various formats and they’ll be matched to the correct model:
You SendCanonical ID
gpt-4openai/gpt-4
gpt4openai/gpt-4
gpt-4-turbo-previewopenai/gpt-4-turbo
claude-3.5-sonnetanthropic/claude-3-5-sonnet
claude-3-5-sonnet-20241022anthropic/claude-3-5-sonnet

Custom Models

If you’re using a custom or fine-tuned model, Greenflash will create a custom model entry for it:
custom-model.ts
const params = {
  model: 'my-company/fine-tuned-gpt4',  // Custom model
  // ... rest of params
};
Custom models appear in your models list with a source: 'customer' tag, distinguishing them from standard models.

Viewing Model Performance

Models List

Navigate to the Models page in your dashboard to see all models used across your products:
  • Usage Count: How many conversations have used each model
  • Quality Index: Average conversation quality score
  • Satisfaction Score: User satisfaction with responses
  • Provider: OpenAI, Anthropic, Google, etc.
You can filter by:
  • Provider (OpenAI, Anthropic, etc.)
  • Product
  • Date range
And sort by:
  • Usage count
  • Quality index
  • Last used date

Model Detail Page

Click on any model to see detailed analytics:
  • Quality trends over 7, 30, or 90 days
  • Four pillar scores: Satisfaction, Growth, Friction, Safety
  • Sentiment analysis: Average user sentiment and trends
  • Products using this model: See which products rely on this model

Comparing Models

Greenflash makes it easy to compare models side-by-side. You can start a comparison from multiple places:

From the Model Detail Page

  1. Navigate to any model’s detail page
  2. Click the Compare button in the header
  3. Select additional models to compare (up to 5 total)

From the Models List

  1. Go to the Models page
  2. Click the menu on any model card
  3. Select Compare
  4. The comparison dialog opens with that model pre-selected

The Comparison Dialog

Use the searchable Add model dropdown to find models by name or provider. Models are grouped by whether they have analytics data available, so you can focus on models with meaningful metrics. The comparison view shows metrics side-by-side with visual indicators:
MetricDescription
Quality IndexOverall conversation quality (0-100)
SatisfactionUser satisfaction percentage
SafetySafety score percentage
Hallucination RatePercentage of conversations with hallucinations
Jailbreak RatePercentage of jailbreak attempts
ConversationsTotal conversation count
Input/Output CostCost per 1K tokens
The best value for each metric is highlighted with a trophy icon, making it easy to identify which model excels in each area.
Model performance varies by use case. A model that performs well for customer support might not be optimal for code generation. Always compare models within the context of a specific product.

Model Recommendations

Greenflash automatically analyzes your model performance and generates actionable recommendations to help you optimize. This feature combines rules-based issue detection with AI-powered synthesis to surface the single most important optimization opportunity for each model.

How It Works

  1. Issue Detection - We continuously monitor metrics against thresholds to detect problems like high hallucination rates, user frustration, declining quality, and stale models with newer alternatives available
  2. Conversation Analysis - We sample representative conversations across the quality distribution to understand patterns and gather evidence
  3. AI Synthesis - An LLM synthesizes detected issues, conversation patterns, topic distribution, cost data, and user ratings into one personalized, actionable recommendation grounded in your product’s actual data

What You’ll See

When recommendations are available, the model detail page shows:
  • Summary - A personalized assessment of model health tailored to your product’s use case
  • Strengths - What’s working well for your specific product (up to 3)
  • Recommendation - One clear, prioritized action to take, including:
    • Action - Specific step to take
    • Rationale - Why this matters and what pattern was observed
    • Evidence - Example snippets from actual conversations
    • Expected Impact - Which metrics should improve (quality, satisfaction, safety, cost, frustration)
    • Category - Type of change: model switch, cost optimization, use case fit, or prompt optimization link

Stale Model Detection

Greenflash maintains a catalog of known model upgrade paths across providers. When a newer model from the same provider is available at equal or lower cost, we flag it as a detected issue and factor it into the recommendation. This makes it easy to identify low-risk upgrades where you get a newer model without paying more. Each recommendation includes links to view the relevant conversations. Click through to see:
  • Conversations with detected issues (hallucinations, jailbreaks, etc.)
  • High and low quality conversation examples
  • Patterns that informed the recommendation
Recommendations require at least 10 analyzed conversations. If you don’t see recommendations yet, wait for more conversations to be processed.

Understanding Model Metrics

Quality Index

The Conversation Quality Index measures how well the model handled conversations on a 0-100 scale. It’s calculated from:
  • User satisfaction signals
  • Conversation outcomes
  • Safety metrics

Pillar Scores

Each model is scored on four pillars:
PillarWhat it measures
SatisfactionDid users get what they needed?
GrowthDid the interaction drive positive outcomes?
FrictionHow smooth was the interaction?
SafetyWere there any safety issues?

Sentiment Analysis

Track how users feel during conversations:
  • Average sentiment: Positive, neutral, or negative
  • Sentiment change: Did users’ sentiment improve or decline?

Best Practices

1. Track All Model Usage

Always send the model parameter, even for simple integrations:
always-track.ts
// Good: Always include model
const params = {
  model: 'gpt-4-turbo',
  messages: [...]
};

// Less ideal: No model tracking
const params = {
  messages: [...]  // Can't analyze model performance
};

2. Use Consistent Naming

Pick a naming convention and stick to it. Greenflash will normalize, but consistency helps:
// Pick one style
model: 'gpt-4-turbo'  // OpenAI style
model: 'anthropic/claude-3-5-sonnet'  // OpenRouter style

3. Segment by Product

Model performance varies by use case. Create separate products for different use cases to get meaningful comparisons:
// Customer support product
const supportParams = {
  productId: 'support-bot-prod',
  model: 'claude-3-5-sonnet',
  ...
};

// Code assistant product
const codeParams = {
  productId: 'code-assistant-prod',
  model: 'gpt-4-turbo',
  ...
};

4. A/B Test Models

Use sampling to compare models on the same traffic:
ab-test.ts
const model = Math.random() > 0.5 ? 'gpt-4-turbo' : 'claude-3-5-sonnet';

const params = {
  productId: 'YOUR_PRODUCT_ID',
  model: model,
  messages: [...]
};
Then compare performance in the Greenflash dashboard.