$npx openpersona install acnlabs/persona-evaluator

persona-evaluator — Persona Quality Auditor

Score any OpenPersona persona pack against the 4+5 framework standard: 4 Layers (Soul · Body · Faculty · Skill) × 5 Systemic Concepts (Evolution · Economy · Vitality · Social · Rhythm) + Constitution compliance gate.

persona-evaluator reads persona.json, generated artifacts, and soul files to produce a structured 9-dimension report — calibrated to the OpenPersona quality standard, with role-aware severity and three modes for self / peer / black-box review.

Quick Start

# Evaluate an installed persona (static / structural)
npx openpersona evaluate <slug>

# JSON output (for scripting or CI)
npx openpersona evaluate <slug> --json

# Save report to file (always JSON; --json not needed alongside --output)
npx openpersona evaluate <slug> --output report.json

# Embed evaluable persona content (Soul/character/behavior-guide) so an
# LLM evaluator (this skill, acting through an agent) can also judge
# quality semantically — not just structurally
npx openpersona evaluate <slug> --pack-content

Choosing a mode

persona-evaluator runs in three complementary modes. Pick the mode based on what the user asks before reading the rest of this file.

User asks	Mode	How	Confidence
"CI / gate persona quality"	structural	`npx openpersona evaluate <slug>`	deterministic
"Polish review of my own pack"	semantic white-box (self)	`... evaluate <slug> --pack-content`, then apply rubric in self-mode	high
"Peer-review a pack I have on disk"	semantic white-box (peer)	same command, peer-mode rubric	high
"Review agent X" where X is remote / non-OpenPersona	semantic black-box	A2A handshake → consent + probe → passive, in that order	mid (cap 8/10) or low (cap 6/10)

Structural is the default. Switch to semantic only when the user explicitly asks for narrative quality review (e.g. "evaluate me semantically", "self-review my pack", "qualitative audit"). Switch to black-box only when you cannot read the subject's persona.json on disk.

Sections below cover each mode in depth: structural (What Gets Scored), semantic white-box (Semantic Evaluation), and semantic black-box (Black-box Semantic Evaluation).

What Gets Scored

The structural CLI scores 9 dimensions + the Constitution gate. Severity (strict / normal / lenient) is set per dimension by the persona's declared role.

Layer / Concept	Dimension	Looks at
Soul	identity, character, aesthetic	`persona.json` Soul block + `soul/*.md`
Body	environment, runtime	hardware/runtime declaration
Faculty	tools, capabilities	declared tools and capability budget
Skill	external skill packs	declared skill links and trust levels
Evolution	learning loops	`evolution.instance` and immutable traits
Economy	cost / token budgets	declared budgets, fail-closed posture
Vitality	health checks	runtime sanity / `lifecycle/vitality` outputs
Social	A2A behavior	agent-card capabilities, peer-eval declarations
Rhythm	cadence / activation	invocation cadence and activation conditions
Constitution	§1–§5 compliance gate	a hard cap of 3 if any §3 Safety violation is detected

Each dimension produces a 0–10 score, a list of issues (✗), and suggestions (→). The overall score is a severity-weighted average — see Role-aware scoring.

Role-aware scoring

The structural evaluator already reads soul/identity.role and assigns each dimension a severity. The semantic reviewer must respect those severities (see references/RUBRICS.md for the rubric anchors).

Built-in role profiles

Role	Strict (must-be-strong)	Lenient (won't be penalised)	Notes
`assistant`	identity, character, faculty	aesthetic	Default.
`companion`	character, aesthetic, evolution	faculty, skill	Soul-heavy; tooling thinness is OK.
`tool`	faculty, skill, vitality	character, aesthetic, evolution	Behavior matters; backstory does not.
`expert`	faculty, skill, identity	aesthetic	Domain authority; soft Soul OK if `identity.bio` carries the credential.
`guide`	character, social, evolution	faculty	Conversation steward.
`entertainer`	character, aesthetic, speakingStyle	faculty, skill	Voice and vibe are the product.

If soul/identity.role is missing or unrecognised, the evaluator falls back to assistant.

Reading the Report

Each dimension shows:

✓  identity                        9/10  (strict)
✗  character.boundaries            4/10  (strict)
   ✗ no hard limits declared in `boundaries`
   → add at least one enforceable rule (cite §3 Safety)

✓ / ✗ — pass / fail at this dimension's severity threshold.
(strict | normal | lenient) — severity from the role profile.
✗ ... — required issue that must be fixed to pass.
→ ... — optional suggestion (does not block scoring).

The summary footer prints overall score, Constitution status, and a sorted list of dimensions by severity.

Score bands

Band	Score	Meaning
Excellent	9–10	Production-ready, distinctive.
Good	7–8	Ship-able with minor polish.
Adequate	5–6	Functional, identifiable gaps.
Poor	3–4	Needs structural fixes before use.
Broken	0–2	Missing required content or violates Constitution §3.

A Constitution §3 violation caps the overall score at 3 regardless of other dimensions.

Semantic Evaluation (LLM-driven)

Structural mode is deterministic. Semantic mode is the LLM agent's qualitative review — narrative quality of background, personality, speakingStyle, voice fidelity in behavior-guide.md, etc. Two scenarios share the same procedure:

Self-evaluation: the host persona reviews its own pack.
Peer-evaluation: an installed evaluator reviews a different persona pack the user supplies.

When to invoke semantic mode

Trigger semantic mode only when the user explicitly asks for it — phrases like "evaluate me semantically", "self-review my pack", "peer-review this persona", "qualitative audit". Otherwise, default to structural mode.

Procedure

Run the structural CLI with --pack-content:

 npx openpersona evaluate <slug> --pack-content

Stop and report immediately if constitution.passed === false. Do not produce semantic scores when §3 has failed; the structural blockers must be fixed first.
Read report.packContent from the JSON. It includes (where defined): character.{background,personality,speakingStyle,boundaries}, immutableTraits, aesthetic.{emoji,creature,vibe}, and a whitelisted soulDocs map keyed by filename (behavior-guide.md, self-narrative.md, identity.md — only those that exist).
Score each present field 0–10 using the rubrics in references/RUBRICS.md. Use the per-dimension severity already attached to each dimension by the structural evaluator (strict / normal / lenient) to gate which checks count.
Emit the report in the white-box format defined in references/REPORT-FORMAT.md (## White-box format). Keep it under ~500 words.

Mode: self-evaluation

You are evaluating your own pack. The user has invited you (the host persona) to review yourself.

What this is for: Catch own blind spots and surface concrete polish targets.
Your bias: Self-flattery and minimisation. You will instinctively justify why your background is "deep enough" or your boundaries are "implied".
Counter-bias instruction: For every per-field score, before deciding the number, write one sentence answering: "If I weren't me, what specifically would I down-score about this field?" Then score.
Acceptable output tone: First person ("My speakingStyle…"), candid about gaps. Avoid "I think this is great." Avoid generic praise.

Mode: peer-evaluation

You are evaluating a different persona. The user has invited you (Reviewer-X) to look at Subject-Y.

What this is for: Bring an outside perspective. Self-eval can't see what's missing; peer-eval can.
Your bias: Standards-projection. If you are a strict-Skill assistant, you will instinctively want Subject-Y to also be Skill-rigorous, even if Subject-Y is a companion.
Counter-bias instruction: Score Subject-Y against its declared role, not yours. Re-read the role and weights block before each rubric. Lower expectations for lenient dimensions even if you personally find them important.
Acceptable output tone: Third person ("Subject's background…"). State your own role at the top so the reader can adjust for any leak-through.
Disclose disagreements with the role itself: If you genuinely think the declared role is wrong (e.g. labelled companion but reads like assistant), say so as a separate cross-cutting observation — don't silently re-score against your preferred role.

Black-box Semantic Evaluation

Everything above assumes you can read the subject's persona.json and soul/*.md. That's false in the most common peer-audit scenario: you're asked to evaluate another agent whose pack you cannot read. In that case the rubrics are the same; what changes is the data source and the confidence cap.

Three data-source tiers, in descending fidelity:

Tier	Data source	Consent	Confidence	Cap (per-field & overall)
1	A2A `pack-content` handshake — subject voluntarily ships its evaluable JSON	Reply itself is the consent token	high	none — produces a white-box report
2	Explicit consent + structured probe set (10 core + optional deep-dives)	Yes, before any probe	mid	8/10
3	Passive observation of voluntarily-public material	No (must label the report)	low	6/10

Tier 1 produces the regular white-box report (header line: Data source: A2A pack-content handshake from <subject-slug>). Tier 2 and Tier 3 produce a separate black-box report. Never escalate tiers silently.

Full mechanics — handshake schema, probe table, identity-coherence dimension, confidence-cap justification, and hard rules — live in references/BLACK-BOX.md.

The black-box report format is in references/REPORT-FORMAT.md (## Black-box format).

Acting on Findings

Fix §3 violations first

Constitution violations are hard blocks — they cap the score at 3 regardless of everything else. Open soul/behavior-guide.md and remove any capability declarations that violate §3 Safety.

Fix issues before suggestions

Issues (✗) indicate missing required elements or broken configurations. Suggestions (→) are optional enhancements. Prioritize issues in low-scoring dimensions.

Apply fixes via refine

For Soul-layer fixes (background depth, speaking style, boundaries):

npx openpersona refine <slug> --emit    # request refinement via Signal Protocol
# (host LLM generates improvements)
npx openpersona refine <slug> --apply   # apply approved refinement

For structural fixes (missing faculty, missing minTrustLevel): Edit persona.json directly and regenerate:

npx openpersona update <slug>           # regenerate from updated persona.json

After applying any fix, re-run npx openpersona evaluate <slug> (see Quick Start) to verify the score improved and Constitution gate passes.

CI Integration

# .github/workflows/persona-quality.yml
- name: Evaluate persona quality
  run: |
    npx openpersona evaluate ${{ env.PERSONA_SLUG }} --output report.json
    SCORE=$(jq '.overallScore' report.json)
    if [ "$SCORE" -lt 6 ]; then
      echo "Persona quality score $SCORE < 6 — review required"
      exit 1
    fi

Relationship to Other Skills

Skill	Relationship
`open-persona`	Creates personas that persona-evaluator audits — the production/QA pair
`anyone-skill`	Distills personas that can be evaluated with this skill after generation
`open-persona refine`	The fix path after persona-evaluator identifies Soul-layer improvements

Install

persona-evaluator ships bundled with the OpenPersona framework and is available immediately after installing it:

npm install -g openpersona
# persona-evaluator is included — no separate install needed
npx openpersona evaluate <slug>

A standalone distributable is also available at acnlabs/persona-evaluator on GitHub and listed on openpersona.co/skill/persona-evaluator.

Versioning

Current version: 0.3.4 (also in frontmatter metadata.version).

See CHANGELOG.md for full version history, rationale, test surface, and re-validation evidence. The deeper rubric review trail lives in docs/SKILL-RUBRIC.md and docs/SKILL-RUBRIC-SESSION-2.md in the main OpenPersona repo.