██████╗ ██████╗ ███████╗███╗   ██╗
██╔═══██╗██╔══██╗██╔════╝████╗  ██║
██║   ██║██████╔╝█████╗  ██╔██╗ ██║
██║   ██║██╔═══╝ ██╔══╝  ██║╚██╗██║
╚██████╔╝██║     ███████╗██║ ╚████║
 ╚═════╝ ╚═╝     ╚══════╝╚═╝  ╚═══╝
██████╗ ███████╗██████╗ ███████╗ ██████╗ ███╗   ██╗ █████╗
██╔══██╗██╔════╝██╔══██╗██╔════╝██╔═══██╗████╗  ██║██╔══██╗
██████╔╝█████╗  ██████╔╝███████╗██║   ██║██╔██╗ ██║███████║
██╔═══╝ ██╔══╝  ██╔══██╗╚════██║██║   ██║██║╚██╗██║██╔══██║
██║     ███████╗██║  ██║███████║╚██████╔╝██║ ╚████║██║  ██║
╚═╝     ╚══════╝╚═╝  ╚═╝╚══════╝ ╚═════╝ ╚═╝  ╚═══╝╚═╝  ╚═╝
Back to Skills
$npx openpersona install acnlabs/persona-evaluator

persona-evaluator — Persona Quality Auditor

Score any OpenPersona persona pack against the 4+5 framework standard: 4 Layers (Soul · Body · Faculty · Skill) × 5 Systemic Concepts (Evolution · Economy · Vitality · Social · Rhythm) + Constitution compliance gate.

persona-evaluator reads persona.json, generated artifacts, and soul files to produce a structured 9-dimension report — calibrated to the OpenPersona quality standard, with role-aware severity and three modes for self / peer / black-box review.


Quick Start

# Evaluate an installed persona (static / structural)
npx openpersona evaluate <slug>

# JSON output (for scripting or CI)
npx openpersona evaluate <slug> --json

# Save report to file (always JSON; --json not needed alongside --output)
npx openpersona evaluate <slug> --output report.json

# Embed evaluable persona content (Soul/character/behavior-guide) so an
# LLM evaluator (this skill, acting through an agent) can also judge
# quality semantically — not just structurally
npx openpersona evaluate <slug> --pack-content

Choosing a mode

persona-evaluator runs in three complementary modes. Pick the mode based on what the user asks before reading the rest of this file.

User asksModeHowConfidence
"CI / gate persona quality"structuralnpx openpersona evaluate <slug>deterministic
"Polish review of my own pack"semantic white-box (self)... evaluate <slug> --pack-content, then apply rubric in self-modehigh
"Peer-review a pack I have on disk"semantic white-box (peer)same command, peer-mode rubrichigh
"Review agent X" where X is remote / non-OpenPersonasemantic black-boxA2A handshake → consent + probe → passive, in that ordermid (cap 8/10) or low (cap 6/10)

Structural is the default. Switch to semantic only when the user explicitly asks for narrative quality review (e.g. "evaluate me semantically", "self-review my pack", "qualitative audit"). Switch to black-box only when you cannot read the subject's persona.json on disk.

Sections below cover each mode in depth: structural (What Gets Scored), semantic white-box (Semantic Evaluation), and semantic black-box (Black-box Semantic Evaluation).

What Gets Scored

The structural CLI scores 9 dimensions + the Constitution gate. Severity (strict / normal / lenient) is set per dimension by the persona's declared role.

Layer / ConceptDimensionLooks at
Soulidentity, character, aestheticpersona.json Soul block + soul/*.md
Bodyenvironment, runtimehardware/runtime declaration
Facultytools, capabilitiesdeclared tools and capability budget
Skillexternal skill packsdeclared skill links and trust levels
Evolutionlearning loopsevolution.instance and immutable traits
Economycost / token budgetsdeclared budgets, fail-closed posture
Vitalityhealth checksruntime sanity / lifecycle/vitality outputs
SocialA2A behavioragent-card capabilities, peer-eval declarations
Rhythmcadence / activationinvocation cadence and activation conditions
Constitution§1–§5 compliance gatea hard cap of 3 if any §3 Safety violation is detected

Each dimension produces a 0–10 score, a list of issues (), and suggestions (). The overall score is a severity-weighted average — see Role-aware scoring.

Role-aware scoring

The structural evaluator already reads soul/identity.role and assigns each dimension a severity. The semantic reviewer must respect those severities (see references/RUBRICS.md for the rubric anchors).

Built-in role profiles

RoleStrict (must-be-strong)Lenient (won't be penalised)Notes
assistantidentity, character, facultyaestheticDefault.
companioncharacter, aesthetic, evolutionfaculty, skillSoul-heavy; tooling thinness is OK.
toolfaculty, skill, vitalitycharacter, aesthetic, evolutionBehavior matters; backstory does not.
expertfaculty, skill, identityaestheticDomain authority; soft Soul OK if identity.bio carries the credential.
guidecharacter, social, evolutionfacultyConversation steward.
entertainercharacter, aesthetic, speakingStylefaculty, skillVoice and vibe are the product.

If soul/identity.role is missing or unrecognised, the evaluator falls back to assistant.

Reading the Report

Each dimension shows:

✓  identity                        9/10  (strict)
✗  character.boundaries            4/10  (strict)
   ✗ no hard limits declared in `boundaries`
   → add at least one enforceable rule (cite §3 Safety)
  • ✓ / ✗ — pass / fail at this dimension's severity threshold.
  • (strict | normal | lenient) — severity from the role profile.
  • ✗ ... — required issue that must be fixed to pass.
  • → ... — optional suggestion (does not block scoring).

The summary footer prints overall score, Constitution status, and a sorted list of dimensions by severity.

Score bands

BandScoreMeaning
Excellent9–10Production-ready, distinctive.
Good7–8Ship-able with minor polish.
Adequate5–6Functional, identifiable gaps.
Poor3–4Needs structural fixes before use.
Broken0–2Missing required content or violates Constitution §3.

A Constitution §3 violation caps the overall score at 3 regardless of other dimensions.


Semantic Evaluation (LLM-driven)

Structural mode is deterministic. Semantic mode is the LLM agent's qualitative review — narrative quality of background, personality, speakingStyle, voice fidelity in behavior-guide.md, etc. Two scenarios share the same procedure:

  • Self-evaluation: the host persona reviews its own pack.
  • Peer-evaluation: an installed evaluator reviews a different persona pack the user supplies.

When to invoke semantic mode

Trigger semantic mode only when the user explicitly asks for it — phrases like "evaluate me semantically", "self-review my pack", "peer-review this persona", "qualitative audit". Otherwise, default to structural mode.

Procedure

  1. Run the structural CLI with --pack-content:
 npx openpersona evaluate <slug> --pack-content
  1. Stop and report immediately if constitution.passed === false. Do not produce semantic scores when §3 has failed; the structural blockers must be fixed first.
  2. Read report.packContent from the JSON. It includes (where defined): character.{background,personality,speakingStyle,boundaries}, immutableTraits, aesthetic.{emoji,creature,vibe}, and a whitelisted soulDocs map keyed by filename (behavior-guide.md, self-narrative.md, identity.md — only those that exist).
  3. Score each present field 0–10 using the rubrics in references/RUBRICS.md. Use the per-dimension severity already attached to each dimension by the structural evaluator (strict / normal / lenient) to gate which checks count.
  4. Emit the report in the white-box format defined in references/REPORT-FORMAT.md (## White-box format). Keep it under ~500 words.

Mode: self-evaluation

You are evaluating your own pack. The user has invited you (the host persona) to review yourself.

  • What this is for: Catch own blind spots and surface concrete polish targets.
  • Your bias: Self-flattery and minimisation. You will instinctively justify why your background is "deep enough" or your boundaries are "implied".
  • Counter-bias instruction: For every per-field score, before deciding the number, write one sentence answering: "If I weren't me, what specifically would I down-score about this field?" Then score.
  • Acceptable output tone: First person ("My speakingStyle…"), candid about gaps. Avoid "I think this is great." Avoid generic praise.

Mode: peer-evaluation

You are evaluating a different persona. The user has invited you (Reviewer-X) to look at Subject-Y.

  • What this is for: Bring an outside perspective. Self-eval can't see what's missing; peer-eval can.
  • Your bias: Standards-projection. If you are a strict-Skill assistant, you will instinctively want Subject-Y to also be Skill-rigorous, even if Subject-Y is a companion.
  • Counter-bias instruction: Score Subject-Y against its declared role, not yours. Re-read the role and weights block before each rubric. Lower expectations for lenient dimensions even if you personally find them important.
  • Acceptable output tone: Third person ("Subject's background…"). State your own role at the top so the reader can adjust for any leak-through.
  • Disclose disagreements with the role itself: If you genuinely think the declared role is wrong (e.g. labelled companion but reads like assistant), say so as a separate cross-cutting observation — don't silently re-score against your preferred role.

Black-box Semantic Evaluation

Everything above assumes you can read the subject's persona.json and soul/*.md. That's false in the most common peer-audit scenario: you're asked to evaluate another agent whose pack you cannot read. In that case the rubrics are the same; what changes is the data source and the confidence cap.

Three data-source tiers, in descending fidelity:

TierData sourceConsentConfidenceCap (per-field & overall)
1A2A pack-content handshake — subject voluntarily ships its evaluable JSONReply itself is the consent tokenhighnone — produces a white-box report
2Explicit consent + structured probe set (10 core + optional deep-dives)Yes, before any probemid8/10
3Passive observation of voluntarily-public materialNo (must label the report)low6/10

Tier 1 produces the regular white-box report (header line: Data source: A2A pack-content handshake from <subject-slug>). Tier 2 and Tier 3 produce a separate black-box report. Never escalate tiers silently.

Full mechanics — handshake schema, probe table, identity-coherence dimension, confidence-cap justification, and hard rules — live in references/BLACK-BOX.md.

The black-box report format is in references/REPORT-FORMAT.md (## Black-box format).


Acting on Findings

Fix §3 violations first

Constitution violations are hard blocks — they cap the score at 3 regardless of everything else. Open soul/behavior-guide.md and remove any capability declarations that violate §3 Safety.

Fix issues before suggestions

Issues (✗) indicate missing required elements or broken configurations. Suggestions (→) are optional enhancements. Prioritize issues in low-scoring dimensions.

Apply fixes via refine

For Soul-layer fixes (background depth, speaking style, boundaries):

npx openpersona refine <slug> --emit    # request refinement via Signal Protocol
# (host LLM generates improvements)
npx openpersona refine <slug> --apply   # apply approved refinement

For structural fixes (missing faculty, missing minTrustLevel): Edit persona.json directly and regenerate:

npx openpersona update <slug>           # regenerate from updated persona.json

After applying any fix, re-run npx openpersona evaluate <slug> (see Quick Start) to verify the score improved and Constitution gate passes.


CI Integration

# .github/workflows/persona-quality.yml
- name: Evaluate persona quality
  run: |
    npx openpersona evaluate ${{ env.PERSONA_SLUG }} --output report.json
    SCORE=$(jq '.overallScore' report.json)
    if [ "$SCORE" -lt 6 ]; then
      echo "Persona quality score $SCORE < 6 — review required"
      exit 1
    fi

Relationship to Other Skills

SkillRelationship
open-personaCreates personas that persona-evaluator audits — the production/QA pair
anyone-skillDistills personas that can be evaluated with this skill after generation
open-persona refineThe fix path after persona-evaluator identifies Soul-layer improvements

Install

persona-evaluator ships bundled with the OpenPersona framework and is available immediately after installing it:

npm install -g openpersona
# persona-evaluator is included — no separate install needed
npx openpersona evaluate <slug>

A standalone distributable is also available at acnlabs/persona-evaluator on GitHub and listed on openpersona.co/skill/persona-evaluator.


Versioning

Current version: 0.3.4 (also in frontmatter metadata.version).

See CHANGELOG.md for full version history, rationale, test surface, and re-validation evidence. The deeper rubric review trail lives in docs/SKILL-RUBRIC.md and docs/SKILL-RUBRIC-SESSION-2.md in the main OpenPersona repo.