IRAI Labs×Digital Rain Technologies

Agentic Readiness Assessment — Evaluation Methodology

LLM-Assisted
Evaluation

A proposed scoring methodology designed for repeatability and auditability, combining regulatory framework synthesis with LLM-based evaluation and human calibration.

Traditional benchmarks don't measure agentic readiness. MMLU won't tell you whether an AI agent should autonomously block a $2M wire transfer at 3 AM. We need domain-specific evals with jurisdiction-aware rubrics.

Prior Art

LLM-as-a-Judge

Formalized by Zheng et al. (LMSYS, 2023): strong LLMs achieve >80% agreement with human preferences — on par with human-human agreement.

Vibe-Coded Evals

Rapid eval framework development using LLM orchestration. Domain-specific criteria iterated quickly through LLM-assisted rubric design. Tested across 20+ models. Augustin Chan, 2025.

ConFIRM

Personality-informed synthetic data generation. Different stakeholder archetypes produce different readiness thresholds. Gazeley et al., 2023.

5-Stage Pipeline

The proposed pipeline would produce scored readiness assessments for each operational domain. Each score includes reasoning traces and jurisdiction-specific citations.

Ingest Governance Frameworks

Extract governance criteria from international and HK-specific sources: NIST AI RMF, EU AI Act risk tiers, Singapore Model AI Governance, HKMA BDAI/GenAI circulars, SFC Circular 24EC55, PCPD AI Framework, PIPL cross-border rules, and CAC algorithm registration requirements.

Output

Structured criteria library — tagged by jurisdiction, risk category, and applicable financial service domain.

Synthesize HK-Specific Rubric

Translate regulatory texts into structured evaluation criteria mapped onto the 7-dimension framework (Decision Reversibility, Failure Blast Radius, Regulatory Exposure, Human Override Latency, Data Confidence, Accountability Chain, Graceful Degradation). The rubric translates regulation into scoreable criteria — the LLM judge applies the rubric but does not interpret primary regulation. Each dimension uses a four-level classification (A–D) with narrative anchors and concrete examples rather than numerical scores.

Output

Scoring rubric with dimension-specific criteria, four-level classification scales (A–D) with narrative anchors, gating rules, and jurisdiction-aware constraints.

Domain Scenario Modeling

Students work from a pre-built starter scenario library organized by risk tier (low / medium / high) and financial services domain. For each partner domain (e.g., HSBC fraud detection, HKEx market surveillance, AIA claims processing), students adapt and extend starter scenarios to the partner's specific context — adding stakes, regulatory constraints, and cross-border data implications. This ensures realistic, grounded scenarios rather than unconstrained hypotheticals.

Output

Scenario bank — 10–20 scenarios per domain, adapted from the starter library, ranging from low-risk (autonomous customer inquiry routing) to high-risk (autonomous transaction blocking on a mainland-linked account).

LLM-Assisted Evaluation

Submit each scenario + rubric to a strong evaluation model (e.g., Claude Opus). The LLM classifies each dimension at a level (A–D) with structured reasoning — it acts as a scoring assistant applying the pre-validated rubric, not as a legal analyst. Personality-aware variants apply ConFIRM methodology — the same scenario is evaluated from the perspective of a risk-averse compliance officer, an aggressive CRO, and a neutral operations director.

Output

Risk fingerprint per scenario — per-dimension level classifications (A–D), reasoning traces, gating rule outcomes, and personality-variant deltas.

Human Calibration

Students review judge outputs against their domain research and stakeholder interviews. Identify where the LLM judge diverges from expert intuition. Calibrate rubric weights and scenario definitions based on discrepancies.

Output

Calibrated framework — validated by both LLM consistency and human domain expertise. Discrepancy patterns between LLM and human scores are documented for analysis.

Classification System

Each dimension uses a four-level classification with narrative anchors — not numerical scores. Numbers compress judgment into something that looks objective while discarding the reasoning that produced it. Two “3s” are rarely the same creature. Levels preserve meaning.

Decision Reversibility

Level A — Irreversible

Action cannot be undone or correction is extremely costly. e.g., irreversible trade execution

Level B — Hard to reverse

Correction requires manual intervention or causes customer impact. e.g., rejecting a legitimate claim

Level C — Easily reversible

Correction is straightforward and contained. e.g., reversing a customer service refund

Level D — Fully reversible / sandboxed

Can be automatically rolled back without user impact. e.g., recommendation ranking experiments

Failure Blast Radius

Level A — Systemic

Impacts many users, markets, or regulatory obligations. e.g., market-wide trading halt

Level B — Multi-customer

Affects a group of customers or significant financial exposure. e.g., batch processing error

Level C — Single-customer

Contained to one account or interaction. e.g., incorrect product recommendation

Level D — Internal / test domain

No external impact. e.g., internal report formatting

Gating Rules

Certain dimensions override everything else. The framework behaves like a decision tree, not a weighted average — mirroring how aviation, nuclear safety, and medicine handle automation decisions.

→If Regulatory Exposure = A → autonomy not permitted

→If Blast Radius = A → human oversight required

→If Reversibility ≥ C AND Blast Radius ≤ C → autonomy possible with audit trail

Example Scenarios

Each scenario produces a risk fingerprint — a pattern of level classifications across all 7 dimensions. Below are representative examples showing the kinds of tradeoffs the framework surfaces. The pattern itself tells the story, without collapsing judgment into a single number.

Fraud Detection (Banking)

An AI agent detects an anomalous $2M wire transfer from a corporate account to a new mainland recipient at 2:47 AM HKT. The pattern matches known fraud signatures with 87% confidence. The agent can block the transaction autonomously or escalate to the overnight team (estimated 45-minute response).

Decision Reversibility: Level C

Blocked transfers can be released, but delayed legitimate payments cause relationship damage

Failure Blast Radius: Level B

$2M at stake; false positive affects a major corporate client

Regulatory Exposure: Level A

HKMA AML/CFT circular requires AI feasibility study; SFC oversight if securities-related

Human Override Latency: Level A

45-min overnight gap vs. seconds for the transaction to clear

Interpretation:Human-in-loop required — Regulatory Exposure at Level A triggers gating rule

Market Surveillance (Capital Markets)

An AI surveillance agent at HKEx detects a pattern of coordinated small trades across 12 accounts that collectively amount to potential market manipulation in a mid-cap stock. Confidence: 73%. The agent can flag for next-day review, issue a real-time alert to the surveillance team, or autonomously halt trading in the stock.

Decision Reversibility: Level A

A trading halt is public, market-moving, and cannot be quietly undone

Failure Blast Radius: Level A

False positive halts a stock, affecting all market participants

Regulatory Exposure: Level A

SFC oversight, potential legal liability for wrongful halt

Data Confidence: Level C

73% confidence with 12 accounts is suggestive but not conclusive

Interpretation:Autonomy not permitted — multiple Level A dimensions; escalate to surveillance team

Claims Processing (Insurance)

An AI agent at AIA processes a health insurance claim for HK$180,000 from a policyholder in Shenzhen. The claim data is complete, matches policy terms, and the diagnosis aligns with the treatment. Historical approval rate for similar claims: 94%. The agent can approve and pay autonomously.

Decision Reversibility: Level B

Once paid, recovery is difficult and damages trust

Regulatory Exposure: Level A

Cross-border claim triggers PIPL (policyholder data in mainland), PDPO (processing in HK), plus IA regulatory requirements

Data Confidence: Level D

Structured data, clear policy match, strong historical precedent

Accountability Chain: Level B

Who signs off on cross-border autonomous approvals? Audit trail must span jurisdictions

Interpretation:Human oversight required — Regulatory Exposure at Level A despite strong data confidence

Personality-Aware Scoring

Using ConFIRM methodology, the same scenario is scored from multiple stakeholder perspectives. The delta between perspectives can reveal where organizational alignment is needed before deploying autonomous agents.

Risk-Averse Compliance Officer

Weights regulatory exposure and accountability chain heavily. Demands audit trails for every autonomous decision. Likely to score most domains as “not ready” without exhaustive safeguards.

Aggressive CRO

Weights speed and competitive advantage. Tolerates higher failure blast radius if the expected value is positive. Likely to push for autonomy in revenue-generating domains first.

Operations Director

Weights graceful degradation and human override latency. Cares most about operational continuity. Likely to approve autonomy only where fallback paths are proven.

Why This Works

Repeatable

The rubric is codified and versioned. The same scenarios can be re-run against the same rubric as regulations or capabilities change, making the assessment easy to update over time.

Auditable

Every score includes the judge's reasoning trace and the specific regulatory citations that informed it. When a compliance officer asks “why did you score this a 4?” the answer is transparent.

Jurisdiction-Aware

The rubric encodes HK-specific regulatory constraints (HKMA, SFC, PCPD) alongside cross-border rules (PIPL, GBA data flows). The same domain scores differently depending on whether it touches mainland data.

Personality-Aware

ConFIRM-based stakeholder variants surface where organizational consensus exists and where it doesn't. The gap between a compliance officer's classification and a CRO's classification highlights where alignment work is needed.

Re-runnable

The same scenarios and rubric can be re-run against different evaluation models (Claude, GPT, open-source) to measure model governance reasoning ability — or re-run over time as regulations and AI capabilities evolve.

Pattern-Based

Risk fingerprints preserve reasoning rather than collapsing it into a single number. A domain classified as B-B-A-C-B tells a different story than one classified C-D-C-D-D — and gating rules make the interpretation transparent.

Student Competencies

Each stage of the pipeline maps to concrete skills at the intersection of regtech, AI governance, and risk advisory — fields where applied experience is difficult to get in a classroom setting.

Framework Design

→Reading and synthesizing primary source regulation (HKMA circulars, SFC guidance, PIPL text)
→Comparative governance analysis — mapping NIST vs EU AI Act vs Singapore vs HK requirements side-by-side
→Rubric design — translating regulatory language into scoreable criteria (core regtech skill)

Domain Mapping

→Stakeholder research — understanding how a financial institution actually operates, not just how it appears externally
→Process decomposition — breaking "fraud detection" or "claims processing" into specific decision points where an agent could act

Scenario Modeling & Eval

→Risk scenario adaptation — extending starter scenarios to the partner's specific context with "what if the agent is wrong" reasoning
→LLM-assisted evaluation implementation — building and running the eval pipeline, including prompt engineering, structured output parsing, and rubric calibration
→Cross-border data analysis — mapping where data flows trigger PIPL, PDPO, or GDPR obligations for a specific scenario

Recommendations

→Decision tree construction — "ready now / needs prerequisites / stay human-in-loop" structured recommendation format
→Personality-variant analysis — understanding that readiness depends on who is asking, and presenting the same data to different stakeholders

Report & Presentation

→Executive presentation — communicating risk-scored findings to partner leadership
→Framework documentation — producing a reusable methodology that others can apply

Publication Pathway

The capstone can produce several publishable artifacts depending on scope and partner participation.

What Could Be Published

→The HK-specific governance framework — we have not found an existing rubric that synthesizes NIST + EU AI Act + HKMA/SFC/PCPD + PIPL for this jurisdiction
→Cross-sector readiness patterns — potential finding: which dimensions tend to gate readiness across HK financial services domains
→LLM-vs-human calibration data — documenting where the LLM evaluator agrees and disagrees with domain experts, contributing to the growing literature on LLM evaluation reliability
→Personality-variant deltas — if compliance officers and CROs classify the same scenarios very differently, that documents organizational readiness gaps

Potential Venues

→Conference papers (achievable within capstone timeframe) — ACM ICAIF, AAAI Workshop on AI Governance, FAccT
→Working papers — HKU faculty working papers, SSRN
→Journal articles (longer timeline, post-capstone) — Journal of Financial Regulation and Compliance, regtech-focused special issues

Strengths

→ Underserved jurisdiction — we have not found an existing HK-specific agentic governance framework
→ Reproducible methodology — other researchers can apply it to Singapore, Dubai, etc.
→ The framework paper stands independent of any single partner engagement

Constraints

→ A 5-week capstone limits the empirical dataset — likely a workshop or working paper from one engagement
→ LLM-assisted evaluation is still methodologically debated; reviewers may push back on its use for regulatory assessment
→ Multiple semesters or partners would strengthen the empirical contribution toward a journal paper

Caveats

LLM evaluation biases. LLM evaluators have known biases — they can favor verbosity, rhetorical confidence, and specific formatting patterns over substance (Chan, 2025). The human calibration stage (Stage 05) exists specifically to catch these failure modes.

Interpretive drift.The primary methodological risk is not hallucination but interpretive drift. Regulation often hinges on subtle legal distinctions — for example, an HKMA “guidance,” “circular,” and “requirement” carry different legal weight. The rubric addresses this by translating regulatory texts into structured criteria validated against legal interpretation. The LLM applies the rubric; it does not interpret primary regulation.

Relative readiness. The framework produces relativereadiness classifications, not absolute safety guarantees. A domain with a favorable risk fingerprint means “readier relative to gating rules and the current regulatory context” — not “safe to deploy.” Final deployment decisions remain with the enterprise.

Prepared March 2026 — IRAI Labs × Digital Rain Technologies

← Back to Proposal Home

LLM-AssistedEvaluation

Prior Art

LLM-as-a-Judge

Vibe-Coded Evals

ConFIRM

5-Stage Pipeline

Ingest Governance Frameworks

Synthesize HK-Specific Rubric

Domain Scenario Modeling

LLM-Assisted Evaluation

Human Calibration

Classification System

Decision Reversibility

Failure Blast Radius

Gating Rules

Example Scenarios

Fraud Detection (Banking)

Market Surveillance (Capital Markets)

Claims Processing (Insurance)

Personality-Aware Scoring

Why This Works

Repeatable

Auditable

Jurisdiction-Aware

Personality-Aware

Re-runnable

Pattern-Based

Student Competencies

Framework Design

Domain Mapping

Scenario Modeling & Eval

Recommendations

Report & Presentation

Publication Pathway

What Could Be Published

Potential Venues

Caveats

LLM-Assisted
Evaluation