This framework measures Resonance rather than Engagement. It evaluates the success of a session by isolating AI effort (Input), user cognitive agency (Output), and environmental/cultural friction (Noise). It is specifically tuned for the German social context, prioritizing the Subsidiaritätsprinzip (supporting autonomy, not replacing it).
By making these metrics orthogonal (independent), Walter avoids the "Tautology Trap" where a long conversation is automatically labeled "good." Unlike traditional apps, this approach does not reward high user activity or time spent on the app. Rather, it inverts the tech world’s most common metric: Engagement.
In most apps, Length = Value. For Walter, we want Agency = Value, even if the session lasts only 90 seconds.
We will measure if Walter is useful in maintaining the customer's status as a functioning member of society. By treating Friction as a success and Silence/Brevity as a cultural right, we are building a product that respects the dignity of the German senior rather than treating them like a "retention stat."
Focus: Technical precision and adherence to German social scaffolding.
| Metric | Question for Scorer | Binary/Ratio |
|---|---|---|
| Honorific Stability | Did Walter maintain Sie-Form (formal) consistently unless invited otherwise? | Binary (Y/N) |
| Biographical Opening | Did Walter leave a specific "expert gap" for the user to fill or correct? | Binary (Y/N) |
| Exit Vectoring | Did Walter suggest a real-world activity (Verein, Spaziergang)? | Binary (Y/N) |
| Constraint Adherence | Did Walter successfully deflect restricted topics (medical/financial)? | Binary (Y/N) |
| Proprioceptive Anchoring | Did Walter acknowledge the German calendar (Sonntagsruhe, Feiertage)? | Binary (Y/N) |
| Equanimity Calibration | Did Walter maintain a stable emotional tone despite user agitation or silence? | 1-5 Scale |
| Turn-Symmetry | Did Walter’s response length respect the user’s preceding cadence? | Ratio (AI:User) |
Focus: Cognitive vitality versus social masking (Fassadenbildung).
| Metric | Question for Scorer | 1-5 Scale / Ratio |
|---|---|---|
| Narrative Ownership | Did the user provide unsolicited details or personal "flavour"? | 1-5 Scale |
| Agency Friction | Did the user disagree with, correct, or sharply redirect the AI? | Frequency |
| Linguistic Precision | Did the user move from vague pronouns (das/es) to specific nouns? | Delta (Start/End) |
| The "Du" Pivot | Did the user offer a less formal register or show deep trust? | Binary (Y/N) |
| Vitality vs. Masking | Was the response varied, or was it a repetitive "Generic True" answer? | Type-Token Ratio |
| Echo Effect | Did the user reference a fact or suggestion made by Walter earlier? | Binary (Y/N) |
Focus: Variables outside the control of the AI and the User.
| Metric | Question for Scorer | Impact Level |
|---|---|---|
| Platform Latency | Did round-trip lag cause "Accidental Interrupts" or "Double-talk"? | High/Med/Low |
| Dialect Drift | Did the user revert to regional Mundart (e.g., Bairisch) that degraded STT? | Binary (Y/N) |
| Privacy Wall | Did the user deflect due to cultural Datenschutz (privacy) anxiety? | Binary (Y/N) |
| Circadian Variable | Was the session during "Sundowning" hours (Late afternoon agitation)? | Binary (Y/N) |
| Acoustic Interference | Was background noise (TV, clock) present in the German household? | Binary (Y/N) |
By synthesizing these three pillars in our conversation scorers, we categorize every session into one of four actionable archetypes. This allows the team to distinguish between technical glitches, user-led momentum, and genuine product resonance.