Greenflash Docs

Greenflash provides comprehensive model tracking and analytics to help you understand which AI models perform best for your use cases. By tracking model performance over time and across products, you can make data-driven decisions about model selection and optimization.

Why Track Model Performance?

Different AI models have different strengths:

GPT-4 might excel at complex reasoning but cost more
Claude 3.5 Sonnet might provide better customer satisfaction
Llama 3 might offer the best balance of cost and quality for your specific use case

Without tracking, you’re guessing. With Greenflash, you can:

See which models deliver the best user satisfaction
Compare quality metrics across models
Track performance trends over time
Make informed decisions about model switching

Sending Model Data

Conversation-Level Model

The simplest approach is to specify the model at the conversation level. All messages in the conversation will be associated with this model.

const params = {
  productId: 'YOUR_PRODUCT_ID',
  externalConversationId: 'YOUR_CONVERSATION_ID',
  externalUserId: 'YOUR_USER_ID',
  model: 'gpt-4-turbo',  // Track which model was used
  messages: [
    { role: 'user', content: 'Explain quantum computing' },
    { role: 'assistant', content: 'Quantum computing uses quantum mechanics...' }
  ]
};

client.messages.create(params);

Per-Message Model (Multi-Agent Scenarios)

For agentic workflows where different steps use different models, you can specify the model at the message level:

const params = {
  productId: 'YOUR_PRODUCT_ID',
  externalConversationId: 'YOUR_CONVERSATION_ID',
  externalUserId: 'YOUR_USER_ID',
  messages: [
    {
      role: 'user',
      content: 'Help me write and review a blog post'
    },
    {
      messageType: 'assistant_message',
      content: 'Here is a draft blog post about...',
      model: 'gpt-4-turbo'  // Writing step uses GPT-4
    },
    {
      messageType: 'thought',
      content: 'Reviewing the draft for clarity and accuracy...',
      model: 'claude-3-5-sonnet'  // Review step uses Claude
    },
    {
      messageType: 'final_response',
      content: 'Here is your revised blog post...',
      model: 'claude-3-5-sonnet'
    }
  ]
};

client.messages.create(params);

Combining Conversation and Message Models

You can set a default model at the conversation level and override it for specific messages:

combined-models.ts

const params = {
  productId: 'YOUR_PRODUCT_ID',
  externalConversationId: 'YOUR_CONVERSATION_ID',
  externalUserId: 'YOUR_USER_ID',
  model: 'gpt-3.5-turbo',  // Default model for most messages
  messages: [
    { role: 'user', content: 'Simple question' },
    { role: 'assistant', content: 'Simple answer' },  // Uses gpt-3.5-turbo
    { role: 'user', content: 'Complex question requiring deep reasoning' },
    {
      role: 'assistant',
      content: 'Detailed answer...',
      model: 'gpt-4-turbo'  // Override: use GPT-4 for complex reasoning
    }
  ]
};

Model Normalization

Greenflash automatically normalizes model strings to canonical identifiers. This means you can send various formats and they’ll be matched to the correct model:

You Send	Canonical ID
`gpt-4`	`openai/gpt-4`
`gpt4`	`openai/gpt-4`
`gpt-4-turbo-preview`	`openai/gpt-4-turbo`
`claude-3.5-sonnet`	`anthropic/claude-3-5-sonnet`
`claude-3-5-sonnet-20241022`	`anthropic/claude-3-5-sonnet`

Custom Models

If you’re using a custom or fine-tuned model, Greenflash will create a custom model entry for it:

custom-model.ts

const params = {
  model: 'my-company/fine-tuned-gpt4',  // Custom model
  // ... rest of params
};

Custom models appear in your models list with a source: 'customer' tag, distinguishing them from standard models.

Viewing Model Performance

Models List

Navigate to the Models page in your dashboard to see all models used across your products:

Usage Count: How many conversations have used each model
Quality Index: Average conversation quality score
Satisfaction Score: User satisfaction with responses
Provider: OpenAI, Anthropic, Google, etc.

You can filter by:

Provider (OpenAI, Anthropic, etc.)
Product
Date range

And sort by:

Usage count
Quality index
Last used date

Model Detail Page

Click on any model to see detailed analytics:

Quality trends over 7, 30, or 90 days
Four pillar scores: Satisfaction, Growth, Friction, Safety
Sentiment analysis: Average user sentiment and trends
Products using this model: See which products rely on this model

Comparing Models

Greenflash makes it easy to compare models side-by-side. You can start a comparison from multiple places:

From the Model Detail Page

Navigate to any model’s detail page
Click the Compare button in the header
Select additional models to compare (up to 5 total)

From the Models List

Go to the Models page
Click the ⋮ menu on any model card
Select Compare
The comparison dialog opens with that model pre-selected

The Comparison Dialog

Use the searchable Add model dropdown to find models by name or provider. Models are grouped by whether they have analytics data available, so you can focus on models with meaningful metrics. The comparison view shows metrics side-by-side with visual indicators:

Metric	Description
Quality Index	Overall conversation quality (0-100)
Satisfaction	User satisfaction percentage
Safety	Safety score percentage
Hallucination Rate	Percentage of conversations with hallucinations
Jailbreak Rate	Percentage of jailbreak attempts
Conversations	Total conversation count
Input/Output Cost	Cost per 1K tokens

The best value for each metric is highlighted with a trophy icon, making it easy to identify which model excels in each area.

Model performance varies by use case. A model that performs well for customer support might not be optimal for code generation. Always compare models within the context of a specific product.

Model Recommendations

Greenflash automatically analyzes your model performance and generates actionable recommendations to help you optimize. This feature combines rules-based issue detection with AI-powered synthesis to surface the single most important optimization opportunity for each model.

How It Works

Issue Detection - We continuously monitor metrics against thresholds to detect problems like high hallucination rates, user frustration, declining quality, and stale models with newer alternatives available
Conversation Analysis - We sample representative conversations across the quality distribution to understand patterns and gather evidence
AI Synthesis - An LLM synthesizes detected issues, conversation patterns, topic distribution, cost data, and user ratings into one personalized, actionable recommendation grounded in your product’s actual data

What You’ll See

When recommendations are available, the model detail page shows:

Summary - A personalized assessment of model health tailored to your product’s use case
Strengths - What’s working well for your specific product (up to 3)
Recommendation - One clear, prioritized action to take, including:
- Action - Specific step to take
- Rationale - Why this matters and what pattern was observed
- Evidence - Example snippets from actual conversations
- Expected Impact - Which metrics should improve (quality, satisfaction, safety, cost, frustration)
- Category - Type of change: model switch, cost optimization, use case fit, or prompt optimization link

Stale Model Detection

Greenflash maintains a catalog of known model upgrade paths across providers. When a newer model from the same provider is available at equal or lower cost, we flag it as a detected issue and factor it into the recommendation. This makes it easy to identify low-risk upgrades where you get a newer model without paying more.

Deep Links to Conversations

Each recommendation includes links to view the relevant conversations. Click through to see:

Conversations with detected issues (hallucinations, jailbreaks, etc.)
High and low quality conversation examples
Patterns that informed the recommendation

Recommendations require at least 10 analyzed conversations. If you don’t see recommendations yet, wait for more conversations to be processed.

Understanding Model Metrics

Quality Index

The Conversation Quality Index measures how well the model handled conversations on a 0-100 scale. It’s calculated from:

User satisfaction signals
Conversation outcomes
Safety metrics

Pillar Scores

Each model is scored on four pillars:

Pillar	What it measures
Satisfaction	Did users get what they needed?
Growth	Did the interaction drive positive outcomes?
Friction	How smooth was the interaction?
Safety	Were there any safety issues?

Sentiment Analysis

Track how users feel during conversations:

Average sentiment: Positive, neutral, or negative
Sentiment change: Did users’ sentiment improve or decline?

Best Practices

1. Track All Model Usage

Always send the model parameter, even for simple integrations:

always-track.ts

// Good: Always include model
const params = {
  model: 'gpt-4-turbo',
  messages: [...]
};

// Less ideal: No model tracking
const params = {
  messages: [...]  // Can't analyze model performance
};

2. Use Consistent Naming

Pick a naming convention and stick to it. Greenflash will normalize, but consistency helps:

// Pick one style
model: 'gpt-4-turbo'  // OpenAI style
model: 'anthropic/claude-3-5-sonnet'  // OpenRouter style

3. Segment by Product

Model performance varies by use case. Create separate products for different use cases to get meaningful comparisons:

// Customer support product
const supportParams = {
  productId: 'support-bot-prod',
  model: 'claude-3-5-sonnet',
  ...
};

// Code assistant product
const codeParams = {
  productId: 'code-assistant-prod',
  model: 'gpt-4-turbo',
  ...
};

4. A/B Test Models

Use sampling to compare models on the same traffic:

ab-test.ts

const model = Math.random() > 0.5 ? 'gpt-4-turbo' : 'claude-3-5-sonnet';

const params = {
  productId: 'YOUR_PRODUCT_ID',
  model: model,
  messages: [...]
};

Then compare performance in the Greenflash dashboard.

Getting Started

Features

Data & Integrations

Analyses

Developers

Model Optimization & Comparison

Why Track Model Performance?

Sending Model Data

Conversation-Level Model

Per-Message Model (Multi-Agent Scenarios)

Combining Conversation and Message Models

Model Normalization

Custom Models

Viewing Model Performance

Models List

Model Detail Page

Comparing Models

From the Model Detail Page

From the Models List

The Comparison Dialog

Model Recommendations

How It Works

What You’ll See

Stale Model Detection

Deep Links to Conversations

Understanding Model Metrics

Quality Index

Pillar Scores

Sentiment Analysis

Best Practices

1. Track All Model Usage

2. Use Consistent Naming

3. Segment by Product

4. A/B Test Models

Getting Started

Features

Data & Integrations

Analyses

Developers

​Why Track Model Performance?

​Sending Model Data

​Conversation-Level Model

​Per-Message Model (Multi-Agent Scenarios)

​Combining Conversation and Message Models

​Model Normalization

​Custom Models

​Viewing Model Performance

​Models List

​Model Detail Page

​Comparing Models

​From the Model Detail Page

​From the Models List

​The Comparison Dialog

​Model Recommendations

​How It Works

​What You’ll See

​Stale Model Detection

​Deep Links to Conversations

​Understanding Model Metrics

​Quality Index

​Pillar Scores

​Sentiment Analysis

​Best Practices

​1. Track All Model Usage

​2. Use Consistent Naming

​3. Segment by Product

​4. A/B Test Models

Why Track Model Performance?

Sending Model Data

Conversation-Level Model

Per-Message Model (Multi-Agent Scenarios)

Combining Conversation and Message Models

Model Normalization

Custom Models

Viewing Model Performance

Models List

Model Detail Page

Comparing Models

From the Model Detail Page

From the Models List

The Comparison Dialog

Model Recommendations

How It Works

What You’ll See

Stale Model Detection

Deep Links to Conversations

Understanding Model Metrics

Quality Index

Pillar Scores

Sentiment Analysis

Best Practices

1. Track All Model Usage

2. Use Consistent Naming

3. Segment by Product

4. A/B Test Models