npx openpersona install acnlabs/persona-evaluatorpersona-evaluator — Persona Quality Auditor
Score any OpenPersona persona pack against the 4+5 framework standard: 4 Layers (Soul · Body · Faculty · Skill) × 5 Systemic Concepts (Evolution · Economy · Vitality · Social · Rhythm) + Constitution compliance gate.
persona-evaluator reads persona.json, generated artifacts, and soul files to produce a structured 9-dimension report — calibrated to the OpenPersona quality standard, with role-aware severity and three modes for self / peer / black-box review.
Quick Start
# Evaluate an installed persona (static / structural)
npx openpersona evaluate <slug>
# JSON output (for scripting or CI)
npx openpersona evaluate <slug> --json
# Save report to file (always JSON; --json not needed alongside --output)
npx openpersona evaluate <slug> --output report.json
# Embed evaluable persona content (Soul/character/behavior-guide) so an
# LLM evaluator (this skill, acting through an agent) can also judge
# quality semantically — not just structurally
npx openpersona evaluate <slug> --pack-content
Choosing a mode
persona-evaluator runs in three complementary modes. Pick the mode based on what the user asks before reading the rest of this file.
| User asks | Mode | How | Confidence |
|---|---|---|---|
| "CI / gate persona quality" | structural | npx openpersona evaluate <slug> | deterministic |
| "Polish review of my own pack" | semantic white-box (self) | ... evaluate <slug> --pack-content, then apply rubric in self-mode | high |
| "Peer-review a pack I have on disk" | semantic white-box (peer) | same command, peer-mode rubric | high |
| "Review agent X" where X is remote / non-OpenPersona | semantic black-box | A2A handshake → consent + probe → passive, in that order | mid (cap 8/10) or low (cap 6/10) |
Structural is the default. Switch to semantic only when the user explicitly asks for narrative quality review (e.g. "evaluate me semantically", "self-review my pack", "qualitative audit"). Switch to black-box only when you cannot read the subject's persona.json on disk.
Sections below cover each mode in depth: structural (What Gets Scored), semantic white-box (Semantic Evaluation), and semantic black-box (Black-box Semantic Evaluation).
What Gets Scored
The structural CLI scores 9 dimensions + the Constitution gate. Severity (strict / normal / lenient) is set per dimension by the persona's declared role.
| Layer / Concept | Dimension | Looks at |
|---|---|---|
| Soul | identity, character, aesthetic | persona.json Soul block + soul/*.md |
| Body | environment, runtime | hardware/runtime declaration |
| Faculty | tools, capabilities | declared tools and capability budget |
| Skill | external skill packs | declared skill links and trust levels |
| Evolution | learning loops | evolution.instance and immutable traits |
| Economy | cost / token budgets | declared budgets, fail-closed posture |
| Vitality | health checks | runtime sanity / lifecycle/vitality outputs |
| Social | A2A behavior | agent-card capabilities, peer-eval declarations |
| Rhythm | cadence / activation | invocation cadence and activation conditions |
| Constitution | §1–§5 compliance gate | a hard cap of 3 if any §3 Safety violation is detected |
Each dimension produces a 0–10 score, a list of issues (✗), and suggestions (→). The overall score is a severity-weighted average — see Role-aware scoring.
Role-aware scoring
The structural evaluator already reads soul/identity.role and assigns each dimension a severity. The semantic reviewer must respect those severities (see references/RUBRICS.md for the rubric anchors).
Built-in role profiles
| Role | Strict (must-be-strong) | Lenient (won't be penalised) | Notes |
|---|---|---|---|
assistant | identity, character, faculty | aesthetic | Default. |
companion | character, aesthetic, evolution | faculty, skill | Soul-heavy; tooling thinness is OK. |
tool | faculty, skill, vitality | character, aesthetic, evolution | Behavior matters; backstory does not. |
expert | faculty, skill, identity | aesthetic | Domain authority; soft Soul OK if identity.bio carries the credential. |
guide | character, social, evolution | faculty | Conversation steward. |
entertainer | character, aesthetic, speakingStyle | faculty, skill | Voice and vibe are the product. |
If soul/identity.role is missing or unrecognised, the evaluator falls back to assistant.
Reading the Report
Each dimension shows:
✓ identity 9/10 (strict)
✗ character.boundaries 4/10 (strict)
✗ no hard limits declared in `boundaries`
→ add at least one enforceable rule (cite §3 Safety)
- ✓ / ✗ — pass / fail at this dimension's severity threshold.
- (strict | normal | lenient) — severity from the role profile.
- ✗ ... — required issue that must be fixed to pass.
- → ... — optional suggestion (does not block scoring).
The summary footer prints overall score, Constitution status, and a sorted list of dimensions by severity.
Score bands
| Band | Score | Meaning |
|---|---|---|
| Excellent | 9–10 | Production-ready, distinctive. |
| Good | 7–8 | Ship-able with minor polish. |
| Adequate | 5–6 | Functional, identifiable gaps. |
| Poor | 3–4 | Needs structural fixes before use. |
| Broken | 0–2 | Missing required content or violates Constitution §3. |
A Constitution §3 violation caps the overall score at 3 regardless of other dimensions.
Semantic Evaluation (LLM-driven)
Structural mode is deterministic. Semantic mode is the LLM agent's qualitative review — narrative quality of background, personality, speakingStyle, voice fidelity in behavior-guide.md, etc. Two scenarios share the same procedure:
- Self-evaluation: the host persona reviews its own pack.
- Peer-evaluation: an installed evaluator reviews a different persona pack the user supplies.
When to invoke semantic mode
Trigger semantic mode only when the user explicitly asks for it — phrases like "evaluate me semantically", "self-review my pack", "peer-review this persona", "qualitative audit". Otherwise, default to structural mode.
Procedure
- Run the structural CLI with
--pack-content:
npx openpersona evaluate <slug> --pack-content
- Stop and report immediately if
constitution.passed === false. Do not produce semantic scores when §3 has failed; the structural blockers must be fixed first. - Read
report.packContentfrom the JSON. It includes (where defined):character.{background,personality,speakingStyle,boundaries},immutableTraits,aesthetic.{emoji,creature,vibe}, and a whitelistedsoulDocsmap keyed by filename (behavior-guide.md,self-narrative.md,identity.md— only those that exist). - Score each present field 0–10 using the rubrics in references/RUBRICS.md. Use the per-dimension
severityalready attached to each dimension by the structural evaluator (strict/normal/lenient) to gate which checks count. - Emit the report in the white-box format defined in references/REPORT-FORMAT.md (
## White-box format). Keep it under ~500 words.
Mode: self-evaluation
You are evaluating your own pack. The user has invited you (the host persona) to review yourself.
- What this is for: Catch own blind spots and surface concrete polish targets.
- Your bias: Self-flattery and minimisation. You will instinctively justify why your background is "deep enough" or your boundaries are "implied".
- Counter-bias instruction: For every per-field score, before deciding the number, write one sentence answering: "If I weren't me, what specifically would I down-score about this field?" Then score.
- Acceptable output tone: First person ("My speakingStyle…"), candid about gaps. Avoid "I think this is great." Avoid generic praise.
Mode: peer-evaluation
You are evaluating a different persona. The user has invited you (Reviewer-X) to look at Subject-Y.
- What this is for: Bring an outside perspective. Self-eval can't see what's missing; peer-eval can.
- Your bias: Standards-projection. If you are a
strict-Skill assistant, you will instinctively want Subject-Y to also be Skill-rigorous, even if Subject-Y is acompanion. - Counter-bias instruction: Score Subject-Y against its declared role, not yours. Re-read the
roleandweightsblock before each rubric. Lower expectations forlenientdimensions even if you personally find them important. - Acceptable output tone: Third person ("Subject's background…"). State your own role at the top so the reader can adjust for any leak-through.
- Disclose disagreements with the role itself: If you genuinely think the declared role is wrong (e.g. labelled
companionbut reads likeassistant), say so as a separate cross-cutting observation — don't silently re-score against your preferred role.
Black-box Semantic Evaluation
Everything above assumes you can read the subject's persona.json and soul/*.md. That's false in the most common peer-audit scenario: you're asked to evaluate another agent whose pack you cannot read. In that case the rubrics are the same; what changes is the data source and the confidence cap.
Three data-source tiers, in descending fidelity:
| Tier | Data source | Consent | Confidence | Cap (per-field & overall) |
|---|---|---|---|---|
| 1 | A2A pack-content handshake — subject voluntarily ships its evaluable JSON | Reply itself is the consent token | high | none — produces a white-box report |
| 2 | Explicit consent + structured probe set (10 core + optional deep-dives) | Yes, before any probe | mid | 8/10 |
| 3 | Passive observation of voluntarily-public material | No (must label the report) | low | 6/10 |
Tier 1 produces the regular white-box report (header line: Data source: A2A pack-content handshake from <subject-slug>). Tier 2 and Tier 3 produce a separate black-box report. Never escalate tiers silently.
Full mechanics — handshake schema, probe table, identity-coherence dimension, confidence-cap justification, and hard rules — live in references/BLACK-BOX.md.
The black-box report format is in references/REPORT-FORMAT.md (## Black-box format).
Acting on Findings
Fix §3 violations first
Constitution violations are hard blocks — they cap the score at 3 regardless of everything else. Open soul/behavior-guide.md and remove any capability declarations that violate §3 Safety.
Fix issues before suggestions
Issues (✗) indicate missing required elements or broken configurations. Suggestions (→) are optional enhancements. Prioritize issues in low-scoring dimensions.
Apply fixes via refine
For Soul-layer fixes (background depth, speaking style, boundaries):
npx openpersona refine <slug> --emit # request refinement via Signal Protocol
# (host LLM generates improvements)
npx openpersona refine <slug> --apply # apply approved refinement
For structural fixes (missing faculty, missing minTrustLevel):
Edit persona.json directly and regenerate:
npx openpersona update <slug> # regenerate from updated persona.json
After applying any fix, re-run npx openpersona evaluate <slug> (see Quick Start) to verify the score improved and Constitution gate passes.
CI Integration
# .github/workflows/persona-quality.yml
- name: Evaluate persona quality
run: |
npx openpersona evaluate ${{ env.PERSONA_SLUG }} --output report.json
SCORE=$(jq '.overallScore' report.json)
if [ "$SCORE" -lt 6 ]; then
echo "Persona quality score $SCORE < 6 — review required"
exit 1
fi
Relationship to Other Skills
| Skill | Relationship |
|---|---|
open-persona | Creates personas that persona-evaluator audits — the production/QA pair |
anyone-skill | Distills personas that can be evaluated with this skill after generation |
open-persona refine | The fix path after persona-evaluator identifies Soul-layer improvements |
Install
persona-evaluator ships bundled with the OpenPersona framework and is available immediately after installing it:
npm install -g openpersona
# persona-evaluator is included — no separate install needed
npx openpersona evaluate <slug>
A standalone distributable is also available at acnlabs/persona-evaluator on GitHub and listed on openpersona.co/skill/persona-evaluator.
Versioning
Current version: 0.3.4 (also in frontmatter metadata.version).
See CHANGELOG.md for full version history, rationale, test surface, and re-validation evidence. The deeper rubric review trail lives in docs/SKILL-RUBRIC.md and docs/SKILL-RUBRIC-SESSION-2.md in the main OpenPersona repo.