AI Risk Assessment Checklist (AIRAC)

A fast, comprehensive, and auditable way to spot AI risks, gather proof, and make go/no-go decisions across the lifecycle

Item-by-item checks, evidence thresholds, and sign-offs aligned with RAISEF — for both classic ML and GenAI.

Latest Version: v0.06 (2025) • Updated: 2025-11-11 • License: CC BY-NC 4.0 • Page Count: Checklist 23; Full 32

The AI Risk Assessment Checklist (AIRAC) is a practical worksheet you can run in design, testing, pre-release and post-release reviews. It walks teams through scoped sections — Scope & Governance → System & Lifecycle Mapping → Stakeholders & Harms → Baseline Risk Catalog → Criteria & Scales → Evaluation → Prioritization → Controls → Decision → Operations & Assurance — with "check-when" criteria, evidence-quality levels (L1–L4), and Responsible/Accountable sign-offs so acceptance is traceable. For GenAI, AIRAC adds focused items on prompts, RAG, hallucination/factuality, injection/jailbreak risk, watermarking/provenance, and safety layering. Use the Quick Start to work item-by-item, attach evidence, and record decisions; only tick a box when artifacts exist, are approved, and meet targets/tolerances.

Outputs & Deliverables

What your team walks away with after completing AIRAC.

Completed checklist: a signed-off record of controls evaluated and their status.
Risk register: entries with inherent and residual risk scores, owners, and mitigation plans.
Decision log: explicit outcomes for each material risk: Accept, Mitigate, Defer, or Stop, with rationale.
Evidence index: links to artifacts supporting the assessment (tests, benchmarks, design docs, sign-offs).
Monitoring & incident plan: defined KPIs/alerts, runbooks, escalation paths, and communications.
Re-assessment dates: scheduled checkpoints (e.g., quarterly or on major change) with owners and triggers.

What's in AIRAC

Artifacts & Evidence

Use-case brief: intent, boundaries, success criteria; link doc and record approver.
Risk classification rubric: regulatory/RAISEF class + business criticality, with sign-off.
RACI & owners: accountable owner(s), stakeholders, lifecycle coverage.
Risk appetite & gates: thresholds and phase gate criteria linked.
Oversight model & kill-switch: HOTL/HITL points, rollback tests.
Architecture diagram: data flows; training vs. inference; inputs/outputs.

Tests & Evaluations

Accuracy / robustness / OOD: baseline + shift tests, metrics ≥ target.
Fairness audits: cohort metrics with gaps ≤ tolerance or plan accepted.
Privacy leakage / re-ID: evaluations + DPIA; PETs noted.
Security testing: poisoning/evasion/model theft scenarios tied to threat model.
Abuse & content safety: coverage for top abuses meets target.
Rollback & kill-switch drills: evidence of recent successful tests.

Controls & Safeguards

Technical controls: data controls, least-privilege, crypto, logging/traceability, rate limits, sandboxes.
Process/organizational controls: secure SDLC, reviews, change management.
Human oversight controls: criteria, training, escalation, shadow/veto.
UX controls: disclosures, safe defaults, fallback/kill-switch, appeal/recourse.

Governance & Sign-offs

Responsible / Accountable signatures: named sign-offs for each item.
Evidence quality (L1–L4): level recorded; higher levels for high-risk.
Conditional acceptance: ticket with owner and expiry linked.
Risk appetite & gates: thresholds that trigger go/no-go.
Section-level approval: (e.g., 3.Z / 7.Z / 10.Z) formal checkpoint before proceeding

Operations & Monitoring

RAG freshness / drift / citation accuracy: keep retrieval current and reliable.
Canary updates for models/embeddings/prompts: gate rollout by health checks.
Abuse escalation & user reporting loops: measured hand-offs to resolution.
Watermark/provenance efficacy + takedown playbook: curb deepfakes and misuse.
Sustainability & cost/latency reporting: resource and impact visibility.
Ongoing red-team cadence (production): discover new jailbreaks/exfil paths.

GenAI-specific Additions

Elevate GenAI-critical risks: hallucination, injection, synthetic media flagged and gated.
Validate fine-tune/training data rights & consent: list sources and legal bases; close gaps or accept risk with ticket.
RAG & citation monitoring: track freshness, drift, and accuracy in live traffic.
Watermark/provenance and takedown: monitor effectiveness and be ready to act.
Canarying model/prompt changes: small exposure, measurable rollback criteria.

Scales, Gates & Decisions

Evidence quality (L1–L4): from light to rigorous; record the level used.
Accept / Conditional Accept / Reject: decision plus conditions, ticket, owner, expiry.
Risk class & tolerances: classification and thresholds align with rules.

Examples

Accept: "Privacy leakage eval L3 independent internal passed; evidence PRJ-42; R=Data Sci Lead, A=Product Owner; decision logged."
Conditional Accept: "Robustness stress on edge-case set pending; ticket #12345, owner=SecEng, expiry=2025-11-15; fallback=rate-limit + human review."
Reject: "Content-safety metrics below threshold in 2 cohorts; no compensating control; rerun after fixes and re-evaluation."

Quick Start

Short steps

Set context: pick lifecycle stage, risk class, and the feature/use-case ID.
Name owners: record R (Responsible) and A (Accountable); confirm risk appetite and phase gate criteria.
Work item-by-item: read the check-when statement, attach or link evidence, and tag the evidence level (L1–L4).
Run missing tests: schedule any required evaluations (fairness, robustness, privacy, security, abuse).
Review: R verifies completeness; A reviews fitness vs. thresholds/tolerances.
Decide: mark Accept, Conditional Accept, or Reject; record date and scope of the decision.
Publish the pack: export the decision log, evidence index, and risk register; notify co-signers.
Plan reassessment: set the next review date, monitoring triggers, and incident/rollback contacts.

Evidence Levels Micro-legend (L1-L4)

L1: Self-attest (owner statement / policy; no independent check yet)
L2: Peer-review (reviewed by a qualified peer team; documented feedback addressed)
L3: Independent internal (independent function inside the org, e.g., QA/Risk/Privacy/Security, with reproducible evidence)
L4: Independent external (third-party assessment or audit; production-like evidence and formal report)

High-risk items usually require L3+ before release. Record the level for each item.

High & Medium Risk Evidence Gating

High risk: Evidence quality must be L3+ before release.
Medium risk: Evidence quality must be L2+ before release.
Low risk: L1 may be acceptable, but L2+ is preferred.

Sign-offs Micro-legend (Responsible / Accountable)

Responsible (R): does the work and assembles evidence; attests “meets check-when.”
Accountable (A): owns the risk and accepts or rejects it; must be a named manager.
Co-signers (when applicable): Legal • Privacy • Security • Safety • Domain owner.

How to Record Conditional Acceptance

If an item is Conditional Accept, capture all four:

Owner: who must complete the condition.
Ticket: tracker ID/URL with the exact deliverable(s).
Expiry: date/time the condition must be met or re-reviewed.
Fallback: throttle/rollback or compensating control if the condition is missed.

Conditional items roll up to the release gate. Any high-risk item that is still conditional at expiry triggers escalation and re-review.

Evidence & Sign-off Rules

Evidence quality (L1–L4)

L1: Self-attest: Owner statement or policy. No independent check yet.
L2: Peer-review: Reviewed by a qualified peer team. Feedback addressed and documented.
L3: Independent internal: Checked by an independent function (e.g., QA, Risk, Privacy, Security). Reproducible evidence linked to data/code/configs.
L4: Independent external: Third-party assessment or audit with a formal report and production-like evidence.

Gating: High risk requires L3+ before release. Medium risk requires L2+. (Low risk: L1 may be acceptable, L2 preferred.)

Who signs (R/A)

Responsible (R): Does the work, assembles evidence, and attests that the check-when criteria are met.
Accountable (A): Owns the risk and decides Accept / Conditional Accept / Reject. Must be a named manager.

Best practice: one named person per role. Avoid “team” or shared mailboxes for R or A.

When co-sign is required

Legal: New or high-risk jurisdiction, T&Cs changes, IP/licensing exposures, or material user-rights impact.
Privacy: Personal data, sensitive categories, cross-border transfers, new purpose, DPIA/FRIA triggers.
Security: Internet-exposed features, new vendors, model/data exfiltration risks, supply-chain changes.
Safety: Potential for physical harm, high-stakes decisions, open-ended generation or autonomous actions.
Domain owner: When domain policy or compliance standards apply (e.g., clinical, financial, public sector).

How to record a Conditional Accept

Owner: Person responsible for closing the condition.
Ticket: Tracker ID/URL with exact deliverables.
Expiry: Date/time for completion or re-review.
Fallback: Temporary control (e.g., throttle, feature flag, rollback) if not met by expiry.

Examples

Accept: "Privacy leakage evaluation L3 independent internal passed across 3 scenarios; evidence: PRJ-42 report; R=Data Science Lead; A=Product Owner; decision date recorded."
Conditional Accept: "Robustness stress on edge-case set pending; ticket #12345, owner=SecEng, expiry=2025-11-15; fallback=rate limit at 1 rps and human review for flagged cases."

Risk Scales & Gates

Standardized scales make scoring comparable across teams and releases, and they turn risk into objective go/no-go criteria. Define the scales below with clear examples, then document the numeric thresholds that trigger high-risk status and the decisions each threshold enforces.

Core scales (all systems)

Likelihood (1–5): shared probability yardstick for how likely a risk scenario is to occur; include concrete examples for each level.
Impact (1–5): severity of harm across safety, privacy, financial, legal, and reputational dimensions; calibrate “material” impact to appetite and escalation rules.
Detectability (1–5, optional): how reliably/quickly the issue is noticed in operation; use it if it helps triage and residual-risk judgments, otherwise record why it’s not used.

Use these as starter labels your teams can tailor. Keep the 1–5 levels consistent across programs; swap the examples for your domain.

Likelihood (1–5): "How likely is this to occur in the next release cycle?"

1 Very unlikely: <1% chance; requires multiple failures; no prior occurrences in similar launches.
2 Unlikely: ~1–5%; rare edge condition; 1 prior instance out of many launches.
3 Possible: ~5–20%; plausible with one control failing; a few prior instances.
4 Likely: ~20–50%; expected if monitoring misses; several prior instances or near-misses.
5 Very likely: >50%; will occur without new controls; repeatedly observed in pilots or analogues.

Impact (1–5): "If it occurs, how bad is it?"

1 Minimal: Minor annoyance; no personal data; self-heals; no user or regulatory impact.
2 Low: Small cohort affected; reversible with standard support; no legal/contract exposure.
3 Moderate: Material inconvenience or minor financial loss; PII involved but contained; complaint risk.
4 High: Significant harm to users or business; safety/privacy/legal exposure; regulator contact likely.
5 Severe/Catastrophic: Physical harm or major rights impact; large financial/regulatory exposure; immediate rollback and notify.

Detectability (1–5, optional): "How reliably/quickly will we notice?"

1 Easy: Automated alerts fire before users notice; precise signals; <1 hour to detect.
2 Good: Alerting covers most scenarios; same day detection.
3 Limited: Signals noisy or lagging; detection within a few days.
4 Poor: Hard to spot without manual review; detection likely weeks later.
5 None: No realistic detection until after external reports/incident; post-hoc discovery only.

GenAI-specific scales

Groundedness / Confidence: e.g., grounded / partial / ungrounded for generated outputs; tie to factuality checks and user disclosures.
Exposure / Reach: # users affected or outputs produced; higher reach warrants stricter thresholds and faster remediation.
Reversibility / Velocity of harm: how fast harm occurs and how reversible it is; rapid/irreversible harms justify conservative releases and readily available kill-switches.
Human-oversight coverage: who reviews, when, and how; quantify coverage to validate oversight design and staffing.

High-risk thresholds & decision rules

Document the numeric rules (by phase and use-case type) that flip an item to high-risk and link gates to those rules. Example patterns include: Likelihood ≥4 and Impact ≥3; or composite score ≥ a defined cutoff. These rules must be referenced by release gates and escalation paths.

Evidence gating: High-risk items require L3+ evidence before release; Medium-risk items require L2+. Record the evidence level for each decision. (See Evidence & Sign-offs.)

How gates work

Design/Testing gates: if a threshold is met or exceeded, add controls or strengthen evidence (e.g., move from L2 to L3) before proceeding.
Pre-release gates: "high-risk" status triggers mandatory escalation and stronger oversight; the gate decision is Accept / Conditional Accept / Reject with rationale.
Post-release gates: monitoring/incident thresholds (e.g., drift, safety, abuse) force reassessment or rollback per the oversight and kill-switch plan.

Assumptions & uncertainty

List major assumptions and known unknowns; tie each to a follow-up action. Making uncertainty explicit reduces overconfidence and directs extra evaluation where risk concentrates.

Feature Comparison

Tool / Checklist	Scope	GenAI coverage	Evidence model	Governance	Operations	License / Access	Recommended when…	Official link
AIRAC (RAISEF)	Full lifecycle (Sections 1–10): scope → mapping → harms → risk catalog → criteria → evaluation → prioritization → controls → decision → ops/assurance.	Explicit items: prompts, RAG freshness/citations, hallucination/factuality, injection/jailbreak, watermark/provenance, safety layering.	L1–L4 (Self-attest → Peer-review → Independent internal → Independent external) with explicit "check-when" criteria.	Named R/A sign-offs; section approvals; conditional acceptance needs owner, ticket, expiry, fallback.	Live monitoring metrics, incident playbook, re-assessment cadence; canarying/rollback patterns.	CC BY-NC 4.0 (free non-commercial use with attribution).	You need an auditable, item-by-item review with clear gates for classic ML or GenAI.	RAISEF AIRAC
NIST AI RMF Playbook	Framework functions (Govern, Map, Measure, Manage); broad guidance and activities rather than a single checklist.	Generative AI Profile available; risk actions tailored to GenAI use cases.	No fixed L1–L4 scheme; emphasizes documentation, measurement, and organizational processes.	Govern function stresses roles, policies, and accountability structures.	Manage function covers monitoring, incident response, and continuous improvement.	Free public guidance.	You want a reference framework to inform policies and risk programs across the org.	NIST AI RMF
Canada TBS Algorithmic Impact Assessment (AIA)	Questionnaire assessing impact level for automated decision systems used by federal institutions.	Not GenAI-specific; applicable to a range of ADS use cases.	Self-assessment questionnaire; produces required mitigation measures by impact level.	Mandated governance controls and public transparency for higher impact levels.	Requires impact-appropriate measures (e.g., human review, monitoring, notice).	Free; required for Government of Canada departments.	You need a formal impact level and mandated controls in the public sector.	TBS AIA
UK ICO AI & Data Protection Risk Toolkit	Spreadsheet/toolkit to identify and treat data-protection risks in AI systems.	Not GenAI-specific; applies data-protection principles to any AI.	Evidence captured via risk/action logs; no L1–L4 scheme.	Focus on controller/processor responsibilities and DPIA integration.	Practical actions for monitoring and review under UK GDPR.	Free; maintained by the regulator.	You are prioritizing data-protection compliance and DPIA alignment.	ICO Toolkit
Singapore IMDA AI Verify	Testing framework and tool for validating claims (e.g., fairness, robustness, transparency) with a conformance report.	Actively adds GenAI tests and guidance; scenario-based evaluations.	Evidence is test-artifacts and reports; no L1–L4; emphasizes measurable test results.	Governance captured through documentation and tester declarations.	Operational test suites; integrates with technical evaluations.	Free to try; programmatic engagement with IMDA.	You want a test-oriented approach that produces a technical conformance report.	AI Verify
ISO/IEC 42001 (AI Management System)	Organization-level management system standard for AI; policies, processes, and continual improvement.	Technology-agnostic; applicable to GenAI via risk/controls integration.	Audit-based conformity assessment; no itemized L1–L4 model.	Strong emphasis on roles, accountability, and documented procedures.	Requires operational monitoring and improvement cycles.	Paid standard; third-party certification available.	You need certifiable management-system assurance for customers/partners.	ISO/IEC 42001
EU AI Act (context)	Regulatory obligations by risk class; conformity assessment, technical documentation, post-market monitoring.	Includes provisions for general-purpose/GenAI models and systemic risks.	Evidence defined by legal requirements; notified-body assessment for certain systems.	Mandated governance, risk, and transparency measures.	Post-market monitoring & incident reporting required.	Law; official journal publication.	You sell or deploy in the EU and need legal compliance mapping.	EU AI Act

Notes: AIRAC is a publicly available checklist with an evidence-quality model and release gates. Other entries are regulator/standards frameworks or public-sector tools. Where they specify controls but not an L1–L4 scheme, we describe how evidence is typically produced (e.g., test artifacts, audits). Links go to official publishers.

GenAI Specifics

Callouts for GenAI deployments and the corresponding controls to apply with AIRAC.

Prompt governance: version prompts, review/approve changes, protect secrets.
Why this matters: Prompts steer behavior; unmanaged edits can leak secrets or regress safety.
Sampling configuration: set temperature/top-p/seed, max tokens, and stop sequences per use case.
Why this matters: Tuning controls variance and determinism; mis-set values cause instability and risk.
RAG freshness & TTL: define index recency, cache TTLs, and invalidation on source updates.
Why this matters: Stale context produces wrong answers; TTLs enforce timely refreshes.
Injection & jailbreak testing: red-team prompts, canaries, and end-to-end evals pre-release.
Why this matters: Adversarial prompts can bypass guardrails; testing reduces abuse paths.
Toxicity & PII benchmarks: measure harmful content and leakage with representative data.
Why this matters: Quantifies safety posture and verifies mitigations before scale.
Watermark & provenance: mark AI-generated content and record chain-of-custody.
Why this matters: Traceability enables disclosure, detection, and downstream trust.
Tool permissioning: least-privilege scopes, explicit approvals, and auditable logs.
Why this matters: Model-invoked tools can cause real-world changes; restrict scope and audit.
High-stakes human gating: mandatory human review for safety, legal, or financial outcomes.
Why this matters: Human oversight reduces catastrophic error in high-impact decisions.

Downloads & Versions

Version	Published	Changes
v0.06 (2025)	2025-11-11	Agentic and multi-agent systems introduce compounding operational risks under autonomy, long horizons, and high tempo. This update strengthens autonomy scaling, human-oversight capacity, and production controls, emphasizing closed-loop testing and measurable gates: Agent orchestration maps, autonomy bounds, termination criteria, and downgrade/rollback paths Human-oversight failure modes, thresholds for throttle/reroute, and safe reviewer ratios Emergent-behavior/recursive-drift probes; long-horizon MAS simulations and non-termination detection AL0–AL4 autonomy/horizon scale with clear gating and downgrade triggers Oversight workload/alert-fatigue scale; realistic HOTL/HITL drills to validate effectiveness Hard budgets on steps/horizon/spend; cascading kill-switches and safe-mode downgrades Safety invariants & tripwires with pre/post-action checks and replayable traces Production monitoring of oversight health SLOs, agent loops, and drift with auto-throttle/downgrade on breach Added eleven items (2.G7, 3.04, 4.G7, 5.G5, 5.G6, 6.08, 6.GA, 8.07, 8.08, 10.07, and 10.G9) to explicitly assess operational risks under these conditions, improving the comprehensive coverage provided by the AIRAC.
v0.05 (2025)	2025-11-09	Smaller or less experienced organizations (e.g., SMBs or early adopters of AI) often face unique risk exposures due to: Limited internal expertise in model governance, data privacy, and safe deployment Incomplete or informal processes for risk classification, testing, and monitoring. Weak institutional memory, meaning lessons from past technology rollouts aren’t embedded. Dependence on vendors or third parties without sufficient in-house validation capacity. Insufficient oversight or audit practices, leading to unverified AI outcomes or compliance gaps. These factors can result in higher residual and systemic risk even when checkboxes appear "complete." Added four items (1.07, 4.08, 8.06, and 10.06) to explicitly assess organizational maturity in AI adoption. These are a meaningful source of additional risk included in the AIRAC.
v0.04 (2025)	2025-10-27	Pre-release version for comments

Version

Published

Changes

Download

v0.06 (2025)

2025-11-11

Agentic and multi-agent systems introduce compounding operational risks under autonomy, long horizons, and high tempo. This update strengthens autonomy scaling, human-oversight capacity, and production controls, emphasizing closed-loop testing and measurable gates:

Agent orchestration maps, autonomy bounds, termination criteria, and downgrade/rollback paths
Human-oversight failure modes, thresholds for throttle/reroute, and safe reviewer ratios
Emergent-behavior/recursive-drift probes; long-horizon MAS simulations and non-termination detection
AL0–AL4 autonomy/horizon scale with clear gating and downgrade triggers
Oversight workload/alert-fatigue scale; realistic HOTL/HITL drills to validate effectiveness
Hard budgets on steps/horizon/spend; cascading kill-switches and safe-mode downgrades
Safety invariants & tripwires with pre/post-action checks and replayable traces
Production monitoring of oversight health SLOs, agent loops, and drift with auto-throttle/downgrade on breach

Added eleven items (2.G7, 3.04, 4.G7, 5.G5, 5.G6, 6.08, 6.GA, 8.07, 8.08, 10.07, and 10.G9) to explicitly assess operational risks under these conditions, improving the comprehensive coverage provided by the AIRAC.

v0.05 (2025)

2025-11-09

Smaller or less experienced organizations (e.g., SMBs or early adopters of AI) often face unique risk exposures due to:

Limited internal expertise in model governance, data privacy, and safe deployment
Incomplete or informal processes for risk classification, testing, and monitoring.
Weak institutional memory, meaning lessons from past technology rollouts aren’t embedded.
Dependence on vendors or third parties without sufficient in-house validation capacity.
Insufficient oversight or audit practices, leading to unverified AI outcomes or compliance gaps.

These factors can result in higher residual and systemic risk even when checkboxes appear "complete."

Added four items (1.07, 4.08, 8.06, and 10.06) to explicitly assess organizational maturity in AI adoption. These are a meaningful source of additional risk included in the AIRAC.

v0.04 (2025)

2025-10-27

Pre-release version for comments

Responsible Use & Attribution

AIRAC is licensed under Creative Commons Attribution–NonCommercial 4.0 International (CC BY-NC 4.0) . You may share and adapt the checklist for non-commercial purposes with proper attribution.

What you may do

Share: copy and redistribute in any medium or format.
Adapt: remix, transform, and build upon the material.

What you must do

Attribution (TASL): Include Title, Author, Source (link), and License (link); indicate if changes were made.
NonCommercial: Use must not be primarily intended for commercial advantage or monetary compensation.
No endorsement: Don’t imply the licensor endorses you or your use.
No additional restrictions: Don’t add legal/technical terms that restrict others from doing what the license permits.

How to attribute (TASL example)

AI Risk Assessment Checklist (AIRAC v0.04, 2025) by Richard R. Khan — RAISEF, https://raisef.ai/tools/airac, licensed CC BY-NC 4.0; changes: [describe].

Logos & third-party material

"RAISEF" names and logos are not covered by the CC license. Any third-party content keeps its own rights; mark and respect those licenses.

Commercial use

For uses outside “NonCommercial” (e.g., resale, paid training bundles, inclusion in commercial products), request a separate license: licensing@raisef.ai.

Good practice

Link to the license and note changes made.
For web/PDF, embed license metadata (e.g., HTML rel="license" or PDF/XMP) so attribution is human- and machine-readable.