Week 2 of the AI Daily Brief New Year’s resolution (AIDBNY) sessions is about moving away from anecdotal vibes and toward a repeatable way to measure model performance. I’m starting to build a Model Map—a systematic comparison of how different LLMs handle specific task types.
This is a first pass. The goal isn’t a definitive ranking yet, but rather a baseline to see where different models succeed or fail when given the same set of constraints.
The Workflow: Emacs + gptel
My day-to-day work environment is Emacs, and I’ve been wanting to use gptel in more of my workflows. Gptel’s integration into Emacs is quite good, meaning I can leverage AI within almost any place I can move the cursor.
The Prompt Library
I’m sure there’s an exhaustive set of prompts that can be used. But for this initial evaluation, I’ve settled on six prompts categorized by three task types to begin the evaluation process:
- Code: Mechanical aspects of software development.
- Analysis: Use a more advanced level of structural understanding to output insightful and valid diagram syntax.
- Creative: Test the model’s ability to synthesize unstructured notes into structured strategy or narrative.
P001 (Code: Refactoring)
Identifying dependencies and refactoring snippets into reusable, typed functions with proper documentation.
Refactor this code into a reusable function that accepts all dependencies as parameters: [CODE_SNIPPET]
Requirements:
* Identify all variables and functions used in the code block
* Create a function signature that accepts these as parameters
* Replace hard-coded values with function parameters
* Add TypeScript types if applicable
* Include JSDoc comments explaining the function's purpose and parameters
* Handle edge cases and provide default values where appropriate
* Preserve the original functionality exactly
P002 (Code: Debugging)
Finding bugs in JavaScript and providing a root-cause explanation alongside corrected code and test cases.
Debug this JavaScript function and explain the issue: [CODE_SNIPPET]
Requirements:
* Identify the bug(s) in the code
* Explain why the bug occurs
* Provide the corrected code
* Include test cases that demonstrate the fix
* Explain any potential edge cases
P003 (Analysis: Call Flow)
Analyzing a codebase to generate Mermaid sequence diagrams for specific functions.
Generate a Mermaid sequence diagram for the call flow of [FUNCTION_NAME] using the provided code context.
Requirements:
* Accept multiple files as context (codebase)
* Locate the specified function [FUNCTION_NAME]
* Analyze the function's call graph and dependencies
* Generate a Mermaid sequence diagram showing:
* Function calls in chronological order
* Parameter passing between functions
* Return values and data flow
* Error handling paths
* Include all nested function calls and their relationships
* Use proper Mermaid syntax with sequenceDiagram tags
* Provide both the diagram code and a visual representation
P004 (Analysis: Architecture)
Using Structurizr DSL to map out system architecture at various C4 levels (Context, Container, Component, or Code).
Generate a C4 diagram using Structurizr syntax for the provided codebase at [DETAIL_LEVEL].
Requirements:
* Accept multiple files as context (codebase)
* Accept [DETAIL_LEVEL] parameter (C1, C2, C3, or C4)
* Generate appropriate C4 diagram based on level:
* C1: Context diagram (systems and users)
* C2: Container diagram (applications, databases, etc.)
* C3: Component diagram (internal components)
* C4: Code diagram (classes, interfaces, etc.)
* Use proper Structurizr DSL syntax
* Include all relevant elements and relationships
* Provide both the diagram code and a visual representation
* Explain the architectural decisions and patterns identified
P005 (Creative: Strategy Map)
Creating a directed graph of content dependencies and clusters needed to reach a specific topical authority.
Generate a content strategy map and directed graph for achieving [CONTENT_DESTINATION] based on existing content.
Requirements:
* Analyze existing content inventory and identify gaps
* Define clear [CONTENT_DESTINATION] goal (e.g., "Become recognized expert in AI development")
* Create a directed graph showing:
* Required content pieces to reach destination
* Dependencies between content items
* Content hierarchy and relationships
* Content clusters and topical authority
* Generate a visual representation using Mermaid or Graphviz syntax
* Include content type recommendations (blog posts, videos, tutorials, etc.)
* Provide timeline and prioritization suggestions
* Identify key performance indicators for measuring progress
P006 (Creative: Contextual Drafting)
Synthesizing research files and style guidelines into a cohesive post with proper citations and metadata.
Write a blog post based on provided context notes, source documents, and style guidelines.
Requirements:
* Accept multiple context files as research material
* Accept [TARGET_LENGTH] parameter (word count)
* Accept [VOICE_STYLE_GUIDE] parameter (brand tone, vocabulary, formatting)
* Synthesize information from all sources into a cohesive narrative
* Follow the specified voice and style guidelines exactly
* Include proper citations and references where appropriate
* Generate a compelling headline and meta description
* Generate an excerpt of no more than 155 characters
* Provide both the full post and an outline structure
Code Tasks and Agents
While I’m including code refactoring and debugging in this initial test set, I suspect these manual prompts may eventually become redundant. As my workflow leans more toward agentic patterns, the need to explicitly prompt for a refactor starts to feel like an intermediate step.
If an agent is operating with full repository context, these individual tasks should ideally happen in the background. For now, however, testing the underlying logic of the models remains the necessary first step.