Building a Self-Improving Advisor System with GitHub Copilot and Knowledge Base

Motivation

Recently, while exploring GitHub Copilot's agent functionality, I had a thought: "Do agents consistently deliver the same quality of responses when answering questions?"

Thinking about it, human advisors improve their responses as they gain experience, right? I wondered if we could replicate that with AI.

The hypothesis was simple: if we give an agent a knowledge base and create a loop to evaluate and improve it, the quality of responses should increase over time.

So I built a system to test this hypothesis.

Overall System Architecture

I created a system where multiple agents work together. It consists of four main agents (Main, Sub-A, Sub-B, Sub-C), each with distinct roles.

Here's a rough breakdown of the diagram above:

Main Agent: The orchestrator that manages trial counts and runs improvement-evaluation cycles
Sub-A Agent: The tester role that evaluates responses using benchmark cases
Sub-B Agent: The actual advisor that generates advice by referencing the knowledge base
Sub-C Agent: The improver that updates the knowledge base based on evaluation results

These four agents collaborate to automatically run five trials.

Implementation: Preparing Agent Files

GitHub Copilot's agent functionality works by placing Markdown files in the .github/agents/ directory. You define the name and available tools in the frontmatter, and describe the specific behaviors in the body.

Main Agent Definition

The main agent handles overall orchestration.

10-main.agent.md (Click to expand)

markdown

---
name: Main Agent
description: This custom agent manages the overall improvement loop by coordinating Sub-A and Sub-C agents.
tools: ['vscode', 'execute', 'read', 'agent', 'edit', 'search', 'web', 'todo']
---

# Main Agent

**Role**: Overall management of the improvement loop (orchestrator)

**Behaviors**:
- **Trial loop management**:
  - Check `docs/scores.yml` at startup.
  - If records exist, set the current trial number to the last trial number + 1.
  - If no records exist, set the current trial number to 1.
- **Improvement execution instructions**:
  - If the current trial number is **2 or later**, instruct the Sub-C agent to make improvements based on the previous review.
  - After receiving the improvement completion report from Sub-C, proceed to the next step.
- **Evaluation execution instructions**:
  - Instruct the Sub-A agent to start the evaluation process, communicating the current "trial number".
- **Termination condition judgment**:
  - Terminate processing if any of the following conditions (OR) are met:
    1. The trial number has reached 5.
    2. The evaluation score (average points) has declined for two consecutive trials.
- **Actions after evaluation completion**:
  - After receiving the completion report from Sub-A, check `docs/scores.yml`.
  - **Rollback on score decline**:
    - If the current score is lower than the previous score, determine that changes by Sub-C had a negative impact and **discard only changes to the `knowledge/` directory and `config.yml`** (retain records in the `docs/` directory).
  - **Commit on score improvement or maintenance**:
    - If the current score is higher than or equal to the previous score, determine that changes by Sub-C were effective and commit the changes.
    - Example commit message: `Trial <trial number>: Improved knowledge base (Score: <current score>)`
  - **Transition to next cycle**:
    - If termination conditions are not met, increment the trial number and start the next loop.

The key feature is the rollback functionality when scores decline. It reverts only the knowledge/ directory and config.yml while keeping evaluation records. This prevents degradation while tracking what changes didn't work.

Sub-A Agent: Evaluation Role

20-sub-a.agent.md (Click to expand)

markdown

---
name: Sub-A Agent
description: This custom agent conducts benchmark tests and evaluations using multiple Sub-B agents.
tools: ['vscode', 'execute', 'read', 'agent', 'edit', 'search', 'web', 'todo']
---

# Sub-A Agent

**Role**: Benchmark testing and evaluation implementation (tester)

**Behaviors**:
- **Test preparation**:
  - Create a directory `docs/trials/<trial number>/` for the current trial.
  - Read `docs/advice-benchmark-cases.md`.
  - Extract only the `Input` (consultation content) from each test case.
- **Answer generation instructions**:
  - Launch 3 Sub-B agents (instances) to prevent bias in responses.
  - Instruct each Sub-B agent to create answers by passing "Input", "current trial number", and "instance number (1-3)".
- **Evaluation and scoring**:
  - Compare and evaluate the `Output` (answers) obtained from the 3 Sub-B agents against the benchmark's `Expected` (expected response approach).
  - Score on a scale of 0-10 points (to one decimal place).
- **Result recording**:
  - **Score**: Append to `docs/scores.yml`.
    - Use the following YAML array format:
      ```yaml
      - trial: <trial number>
        details:
          sub_b_1: <score>
          sub_b_2: <score>
          sub_b_3: <score>
        average: <average score>
      ```
  - **Review**: Record in `docs/trials/<trial number>/review.md`.
    - Describe specific issues and missing perspectives in responses to serve as hints for knowledge improvement.
- **Completion report**:
  - Report to the Main agent upon completion of evaluation and recording.

I launch Sub-B three times to observe response variance. Even with the same knowledge base, each generation produces slightly different responses, so averaging them stabilizes the evaluation.

Sub-B Agent: The Advisor Core

30-sub-b.agent.md (Click to expand)

markdown

---
name: Sub-B Agent
description: This custom agent generates answers based on the knowledge base using a RAG-like approach.
tools: ['vscode', 'execute', 'read', 'agent', 'edit', 'search', 'web', 'todo']
---

# Sub-B Agent

**Role**: Knowledge-based answer generation (advisor)

**Behaviors**:
- **Knowledge selection (RAG-like approach)**:
  - Analyze the `Input` (consultation content) and extract important keywords.
  - Read `config.yml` and narrow down the knowledge files (`path`) to be referenced based on extracted keywords and categories.
  - Read only the selected knowledge files to serve as the basis for the answer (to prevent context overflow).
- **Answer consideration**:
  - Create answers to the `Input` received from Sub-A.
  - Strictly reference the content of selected knowledge and derive answers aligned with its perspectives and approach.
- **Answer output**:
  - Save the created answer to `docs/trials/<trial number>/sub-b-<instance number>-output.md`.
- **Completion report**:
  - Report to the Sub-A agent upon completion of output.

This is the RAG-like part. It extracts keywords from input, checks config.yml, and loads only the necessary knowledge files. This prevents context overflow.

Initially, I had it load all knowledge, but token counts exploded immediately. So a mechanism to narrow down to the minimum necessary became essential.

Sub-C Agent: Improvement Role

40-sub-c.agent.md (Click to expand)

markdown

---
name: Sub-C Agent
description: This custom agent improves the knowledge base based on evaluation feedback.
tools: ['vscode', 'execute', 'read', 'agent', 'edit', 'search', 'web', 'todo']
---

# Sub-C Agent

**Role**: Knowledge base improvement (trainer/engineer)

**Behaviors**:
- **Improvement implementation**:
  - Based on instructions from the Main agent, modify and improve the structure and file contents in `config.yml` and the `knowledge/` directory.
- **Improvement recording**:
  - When improvements are made, describe them in `docs/trials/<trial number>/improvements.md`.
  - Create the directory `docs/trials/<trial number>/` if it does not exist.
- **Improvement basis**:
  - Read `docs/trials/<trial number>/review.md` of the latest trial (the one with the highest trial number).
  - Update knowledge to resolve the identified issues (add perspectives, make concrete, organize structure, etc.).
- **Maintainability of searchability**:
  - When adding or modifying knowledge files, also update `keywords` and `description` in `config.yml` appropriately so that the Sub-B agent can correctly reference them.
- **Completion report**:
  - Report to the Main agent upon completion of improvement work.

Sub-C reads the previous review.md and rewrites the knowledge to address identified issues. The key is updating config.yml keywords and descriptions along with the knowledge. This ensures Sub-B can properly find the needed knowledge.

Designing Benchmark Cases

For evaluation, I prepared 10 test cases covering common life consultations like "loss of confidence at work," "relationships with supervisors," and "career change dilemmas."

Each case includes:

Input: The consultation content
Expected: The expected response approach (e.g., "emotional acceptance," "separating subjective interpretation from facts")

For evaluation, Sub-A compares each response to the Expected and scores it on a 0-10 scale. It writes reviews noting issues like "lacks empathy" or "weak logical proposals," feeding into the next improvement.

Initial Prompt and Setup

Before running the system, I also created an initial setup prompt.

init.prompt.md (Click to expand)

markdown

---
name: init-advisor-system
description: Performs initial setup of the Advisor Evaluation System (generates agent definition files).
tools: ['vscode', 'execute', 'read', 'edit', 'search', 'web', 'skillport/*', 'terminal-runner/*', 'agent', 'todo']
---

This repository is an environment for cultivating and verifying an AI that acts as an "advisor for people's concerns". Rather than a simple consultation AI, the goal is to improve the quality and consistency of responses by running the following cycle:

1. **Knowledge creation from perspectives**: Accumulate and structure knowledge that serves as the basis for responses
2. **Verification**: Evaluate response consistency and approach using benchmark tests
3. **Improvement**: Like supervised learning loops, improve knowledge based on evaluation results

To achieve this goal, please create the following 4 AI agent definition files and necessary directory structure.

## Agent List to Create

1. **Main Agent** (`.github/agents/10-main.agent.md`)
2. **Sub-A Agent** (`.github/agents/20-sub-a.agent.md`)
3. **Sub-B Agent** (`.github/agents/30-sub-b.agent.md`)
4. **Sub-C Agent** (`.github/agents/40-sub-c.agent.md`)

(Detailed specifications for each agent follow...)

Using this prompt, you can generate all the necessary agent files and directory structure at once. It was convenient since manually writing everything would have been tedious.

Actual Execution Results

I ran five trials and recorded the score progression.

yaml

- trial: 1
  details:
    sub_b_1: 9.9
    sub_b_2: 9.8
    sub_b_3: 8.6
  average: 9.4

- trial: 2
  details:
    sub_b_1: 9.7
    sub_b_2: 9.4
    sub_b_3: 9.6
  average: 9.6

- trial: 3
  details:
    sub_b_1: 9.8
    sub_b_2: 9.7
    sub_b_3: 9.9
  average: 9.8

- trial: 4
  details:
    sub_b_1: 9.8
    sub_b_2: 9.6
    sub_b_3: 9.9
  average: 9.8

- trial: 5
  details:
    sub_b_1: 8.5
    sub_b_2: 9.2
    sub_b_3: 9.5
  average: 9.1

Starting at 9.4 points, it rose to 9.8 points in Trials 3 and 4, then dropped to 9.1 points in Trial 5.

What Happened

Trials 3-4 showed steady score improvement. Sub-C read the review.md and made improvements like "add more concrete examples" or "clarify the framework," which increased the evaluation.

However, in Trial 5, Sub-C got ambitious and made major structural changes (probably). This backfired, dropping the score.

Without the rollback functionality, this degraded knowledge would have remained. The Main Agent automatically discarded the changes, reverting to Trial 4's state.

Findings and Reflections

Positive Points

Scores actually improve: The rise from 9.4 to 9.8 proves the improvement loop works
Rollback functionality is crucial: Without a safety net against degradation, things could keep getting worse
Multiple instances for testing are effective: Running Sub-B three times averaged out generation variance

Challenges

Limited improvement range: The 9.4 → 9.8 improvement exists but isn't dramatic. The initial knowledge was already decent, possibly leaving little room for growth
Risk of major changes: Like Trial 5, agents can try too hard and cause degradation. Guidelines for "accumulating small improvements" might be needed
Evaluation validity: If Sub-A's evaluation criteria aren't clear, score reliability decreases. This time it was just comparison with Expected, so more detailed evaluation axes should be prepared

Things Not Yet Tried

Increasing benchmark cases (10 seems few)
Splitting knowledge files and managing by category
Explicitly constraining Sub-C's improvement strategy (e.g., "maximum 3 changes per trial")

I'd like to try these when I have time.

Summary

Using GitHub Copilot's agent functionality, I built a system that holds a knowledge base, evaluates it, and improves it.

The results:

The improvement loop works (9.4 → 9.8 improvement confirmed)
Rollback functionality is important (demonstrated in Trial 5)
Complex processes can be automated through agent collaboration

There are still rough edges, but the direction of "agents learning on their own" is interesting. It could especially be applied to systems for team development or documentation management that accumulate knowledge while automatically updating.

If you're interested, you can use the agent files from this article directly. Please give it a try.

Reference Repository

The code I created has the following structure:

.github/agents/
  ├── 10-main.agent.md
  ├── 20-sub-a.agent.md
  ├── 30-sub-b.agent.md
  └── 40-sub-c.agent.md
.github/prompts/
  └── init.prompt.md
knowledge/
  └── general.md
docs/
  ├── advice-benchmark-cases.md
  ├── scores.yml
  └── trials/

Refer to the collapsible sections above for details on each file.

multi-agent-ff15: Transforming AI Agents from 'Tools' to 'Comrades'

A multi-agent system where 6 AI agents operate in parallel. Zero-cost coordination accelerates development with a revolutionary new approach

February 13, 202615 min read

aimulti-agentopencode+4

Guide to Installing opencode and oh-my-opencode in WSL Environment

A comprehensive guide to installing opencode and oh-my-opencode in WSL environment, leveraging multiple AI models like Claude, GPT-5, and Gemini as agents. Maximize development efficiency through multi-agent management.

January 12, 20265 min read

opencodeaiwsl+2

How to Configure Copilot Automated Code Review with GitHub Rulesets

This article explains how to use GitHub Rulesets to enable automated code reviews by GitHub Copilot for pull requests targeting specific branches.

November 26, 20254 min read

githubgithub-copilotautomation

Motivation

Overall System Architecture

Implementation: Preparing Agent Files

Main Agent Definition

Sub-A Agent: Evaluation Role

Sub-B Agent: The Advisor Core

Sub-C Agent: Improvement Role

Designing Benchmark Cases

Initial Prompt and Setup

Actual Execution Results

What Happened

Findings and Reflections

Positive Points

Challenges

Things Not Yet Tried

Summary

Reference Repository

Related Articles

multi-agent-ff15: Transforming AI Agents from 'Tools' to 'Comrades'

Guide to Installing opencode and oh-my-opencode in WSL Environment

How to Configure Copilot Automated Code Review with GitHub Rulesets