2025 AI Agents: Ensuring Safe Usage through Evaluation, Verification, and Monitoring Methods with Examples

TecAce Software
Jan 13
6 min read

Updated: Nov 17

1. Introduction

Recently, the use of AI Agents has been increasing across various industries. Moving beyond simple chatbots that answer questions, they are evolving into forms that can autonomously assess situations, utilize necessary tools, and derive results. However, for these agents to operate as safely and accurately as expected, systematic evaluation, verification, and continuous monitoring are essential.This article introduces the necessary evaluation, verification, and monitoring methods for safely operating AI Agents, along with specific examples of how these can be applied in real business environments.

2. Understanding AI Agents

2.1 Definition and Characteristics of AI Agents

AI Agents are software systems that receive input from the environment or users, make autonomous decisions, and achieve specific goals. Recently, combined with large language models (LLMs), they can understand context and automatically handle complex tasks by calling external tools (e.g., APIs, databases, internal systems).

Key characteristics :

Autonomy: Beyond immediately responding to user queries, they directly perform additional actions necessary for problem-solving.
Tool utilization ability: Calls appropriate APIs, databases, and analysis tools as needed to gather necessary information.
Continuous learning and updating: Continuously improves performance by reflecting user feedback and new data.

2.2 General Operation Process

Receive user request
Understand context through LLM, etc.
Identify necessary tools (e.g., news search API, analysis API)
Call tools
Aggregate results and deliver to user
Collect feedback and retrain (optional)

2.3 Use Cases

Customer support: Automated FAQ responses, problem-solving guide provision
Marketing and sales support: Market trend and competitor analysis, report generation
Data analysis: Search for articles or papers on specific topics, summarize and derive insights
Business decision support: Generate materials necessary for decision-making by synthesizing quantitative and qualitative data

3. Importance of AI Agent Evaluation, Verification, and Monitoring

3.1 Need for Evaluation and Verification

AI agents go through a much more complex decision-making process than existing static models. If judgment errors or unnecessary tool usage accumulate, problems such as increased costs, information leakage, and decreased work efficiency can occur. Therefore, beyond simple accuracy evaluation, comprehensive evaluation and continuous monitoring of safety, efficiency, and security are necessary.

3.2 Key Considerations

Accuracy: Evaluate the accuracy of answers or analysis results provided by the agent. This is important to prevent the provision of incorrect information1.
Processing speed and resource usage: Monitor system response time and resource usage (CPU, GPU, memory) to evaluate efficiency.
Cost and token usage: Analyze the costs and token usage of AI models in API calls and result generation processes to evaluate economic efficiency.
Security and permission management: Evaluate whether the agent can be controlled to not call tools inappropriately. Protect sensitive information and maintain system integrity.
Tool usage efficiency: Evaluate how effectively the agent utilizes external tools or APIs. Measure through API call success rate, duplicate call rate, and value of results compared to call cost (ROI).

4. Evaluation and Verification Workflow and Metric Design

4.1 Setting Goals and Metrics

Define goals for each AI agent purpose
- Example: "Maintain news search result accuracy above 90%"
- Example: "Achieve user satisfaction (NPS) of 8 or higher for analysis and summary results"
Set measurable metrics (quantitative and qualitative indicators)
- Quantitative metrics: Response accuracy, API call success rate, Response time
- Qualitative metrics: User satisfaction (CSAT, NPS), Qualitative feedback from interviews

4.2 Experimentation and Data Collection

Beta testing: Pilot operation targeting specific departments or customer groups
Log and feedback collection:
- Tool call logs (success/failure, response time)
- User feedback (NPS, CSAT, surveys)
Error monitoring:
- API call failure rate
- Model inference error occurrence frequency
- Security issues (permission misuse, sensitive data exposure, etc.)

4.3 Evaluation and Improvement

LLM performance analysis: Context understanding, answer validity, summary quality, etc.
Tool call efficiency analysis: Unnecessary duplicate calls, failure rate, cache utilization
Problem classification and improvement:
- Identify causes (prompt issues, dataset bias, model structure limitations, etc.)
- Modify, update, and retest

4.4 Continuous Feedback Loop

Model and agent updates and re-evaluation:
- Re-check performance of improved models, re-measure user satisfaction
Apply DevOps/MLOps:
- Build CI/CD pipeline, automated testing, gradual deployment
- Compare and verify various versions through A/B testing

5. Real-world Example: News Search and Analysis AI Agent

5.1 Scenario Overview: Company A Case

Company A wants to collect industry news weekly and distribute it to employees in report form. The AI Agent automatically searches, analyzes, and summarizes news when given specific keywords, returning results in document form.

5.2 Implementation and Evaluation Metric Examples

Tool usage efficiency:
- API call success rate: 950 successful calls / 1,000 total calls → 95%
- Duplicate call rate: 50 repeated calls for same keyword / 1,000 total calls → 5%
- Cache utilization rate: 300 cache reuse calls / 1,000 total calls → 30%
Analysis result utilization metrics:
- Decision reflection rate: 25 out of 40 generated reports used in actual internal meetings → 62.5%
- User satisfaction: Average survey score 8.2/10 (based on feedback from 100 people)
Cost and token usage:
- API call cost: Monthly total API call cost $50 (based on 1,000 calls)
- Token usage: Average 1,500 tokens used per call. Total 1,500,000 tokens consumed → Approximately $30 cost based on OpenAI pricing

5.3 Logging & Observability Example

Log schema example:

{
"timestamp": "2025-01-07T10:00:00Z",
"tool_name": "NewsSearchAPI",
"request_params": {"keyword": "AI 산업", "dateRange": "1week"},
"response_status": 200,
"latency_ms": 350,
"is_cached": false,
"result_count": 150,
"api_cost": 0.005,
"tool_selection_reason": "Keyword requires real-time news updates",
"alternative_tools_considered": ["CachedNewsAPI", "HistoricalDataAPI"],
"selection_score": 0.92,
"user_feedback_score": 8,
"efficiency_metrics": {
"relevant_results_ratio": 0.85,
"time_saved_sec": 12,
"cost_efficiency": 0.97
}

Monitoring tools:
- Visualize API response time, error rate, duplicate call rate, etc., using Prometheus and Grafana

5.4 Improvement and Operation Process Summary

Beta operation → First deploy to limited users
Log analysis → Identify issues such as duplicate calls, unnecessary API costs, token usage
Introduce caching policy → Add logic to minimize re-requests, optimizing API call costs and token consumption
Measure KPIs after redeployment → Confirm response speed, satisfaction, cost reduction effect, token usage reduction effect
Cost efficiency analysis → Optimize API selection and policies based on cost and token consumption per call
Continuous monitoring → Optimize cost and resource usage through repeated improvements

6. Tips for Safe and Efficient AI Agent Operation

6.1 Human-in-the-loop design:

Design high-risk tasks in the decision-making process to always be confirmed or approved by humans2.
Example: In the medical field, diagnostic results recommended by AI must be reviewed and confirmed by specialists
To implement this, visualize the agent's decision-making flow in a human-understandable form and add a "pending approval" status for important decisions

6.2 Ethical issue and bias monitoring:

Regularly review the diversity and representativeness of training datasets to prevent data bias. For example, data excessively biased towards specific regions or population groups should be excluded or readjusted.
Real-time monitoring: Implement a system that automatically sends warnings or blocks responses when discriminatory or harmful expressions are found in AI-generated responses.
Case: Regularly test whether AI recruitment agents make recommendations biased towards specific genders or races
Refine and apply ethical principles based on guidelines from organizations like Partnership on AI 3

6.3 Evaluation and updates optimized for agent purpose:

Evaluation sophistication through regular user feedback loops:
- Evaluate AI performance and user satisfaction through periodic surveys (NPS) and user log analysis
- Example: Monthly sampling of 10% of requests processed by AI agents to check accuracy
Update new data and evaluation items using MLOps tools:
- Regularly add new data through automated data pipelines to update AI models
- Example: Set up automatic model retraining workflow when major KPIs (performance indicators) drop by more than 10%, update improvement items based on user feedback content
A/B testing for objective evaluation and comparison:
- Apply new models to small user groups first, compare performance with existing models to confirm stability before full deployment
- Objectify A/B test evaluation logic and increase evaluation parameters (Test Cases) to implement consistent evaluation quality improvement

7. Conclusion

AI Agents have now acquired the potential to automate complex decision-making beyond chatbots. To safely and efficiently utilize this potential, clear goals and metrics must be set, and continuous improvement must be made through systematic logging and monitoring.

Key summary:
- Introduce evaluation and verification processes to ensure stability and reliability
- Perform continuous monitoring and updates through DevOps/MLOps pipelines
- Sophisticate evaluation functions optimized for purpose and regulations

Future outlook : More AI Agents are expected to permeate across various industries. Accordingly, collaboration in terms of infrastructure, security, and ethics will become more important than ever.

TecAce's AI Supervision is an AI evaluation and quality supervision solution

provided to enhance the reliability and stability of enterprise AI Agents. It offers solutions such as Human In the Loop, integration with ML/LLM OPs, Evaluation Metric Studio, and high-quality test case generation solutions to augment evaluation parameters, as introduced earlier.

8. References

Partnership on AI: Guidelines for solving AI ethics and bias issues
Prometheus: API monitoring and performance analysis tool
Grafana: Data visualization and dashboard configuration tool
OpenAI Function Calling: OpenAI-based function calling guide
Hugging Face Transformers: NLP and LLM model library
MLOps Community GitHub: Introduction to MLOps-related cases and tools
AI ethics case analysis: AI ethical issues and solutions
How AgentOps Helps Developers Build and Monitor Reliable AI Agents
LLM Judge Metrics - Azure Databricks

Lead AI governance with AI Supervision! AI Supervision ensures transparency and ethical responsibility of AI systems, supporting businesses in establishing reliable AI governance. Create a safer and more trustworthy AI environment with AI Supervision, which offers real-time monitoring, performance evaluation, and compliance with ethical standards!

https://www.tecace.com/ai-supervision