RESEARCH APR 2026 · 8 min read

What cubicle measures, and the research it rests on.

A look at how we designed our AI skills assessment, what the underlying research says, and how it fits into existing hiring processes.

by Chris Ackerman

Co-founder · CTO

TL;DR (90-second read)

The problem we're working on. AI skills are difficult to define and even harder to evaluate in a standardized way. Hiring teams know they need to assess AI fluency — recent work increasingly depends on it — but the field doesn't yet have a shared rubric for what "good" looks like, and existing assessment formats weren't designed for it.
What we measure. Two pillars grounded in peer-reviewed research: Structured Prompting (framing, constraining, and refining instructions to an LLM) and Output Evaluation (detecting errors, hallucinations, and unsafe reasoning in AI-generated work). Nine sub-skills sit underneath, each tied to observable candidate actions and a 0–5 rubric.
How we measure it. Realistic job simulations in knowledge-work context — desk research, output review, multi-turn refinement, repeatable deliverables — with AI tools available alongside source material. We score the process (what the candidate did and in what order), not just the results.
How it fits an existing hiring process. Roughly 30 minutes of candidate time. Designed to sit where a take-home case or screening assessment already sits, not to replace behavioral or final-round case interviews (for now, at least).
Customization. Works out of the box. Adapts to a pasted job description. Ingests real (anonymized) firm materials when a more tailored screen is wanted. Rubrics are constant across all three modes, so candidate scores remain comparable.
Where we are. Funded through MIT Sandbox; looking for 3–5 pilot partners.

→ Book a 20-minute pilot conversation.

The details

Why AI skills are hard to evaluate

Generative AI has changed the day-to-day work that early-career consultants do. Drafting, summarizing, restructuring data, producing competitor profiles at scale — these are increasingly AI-assisted in practice, including at firms whose hiring processes still treat AI as out of scope.

Hiring teams have noticed. The harder question is what to do about it. Three constraints make AI skills uniquely difficult to assess:

It's a doing-skill, not a knowing-skill. Knowing what good AI use looks like is different from doing it under pressure. Quizzes and knowledge-tests measure the first. They miss the second entirely.
Good output ≠ good process. Two candidates can produce nearly identical final artifacts from very different processes — one who set context cleanly on the first prompt, and one who iterated past several confused outputs to get there. They will perform differently on day three of an engagement, but the artifact alone won't tell you which is which.
The research is recent. The last two years have produced real evidence on what skilled AI users do, but it's scattered across different fields and there's no shared way to score it.

We're building to solve for these. Consulting firms, our first customers, already feel the problem. They know AI fluency matters, and they're inventing their own workarounds: take-homes that allow AI + ad hoc questions about how candidates used it. What is missing now is a standardized way to do this without losing the ability to tailor the assessment to their actual work.

Why consulting

Our primary customers are consulting firms for three reasons:

They already iterate their hiring process often and spend real time and budget customizing assessments — there is established willingness to invest in screening quality.
Their retention sits below the white-collar average — 80% at 18 months versus 88% across white-collar work, and MBB tenure averages 2.7 years against a 3.9-year industry mean (Onrec; BLS 2024; CaseCoach). A weak hire is more costly when the firm has fewer people to absorb the gap.
They compete on cutting-edge skills against larger firms but typically can't update assessment infrastructure as quickly. AI-era skill gaps in new hires are felt acutely.

Note: the skills we measure generalize beyond consulting.

What we measure

The assessment is organized around the following two pillars:

Structured Prompting is how well someone frames, constrains, and refines instructions to an AI to get a specific work outcome. Six sub-skills sit underneath, covering things like setting context, specifying outputs, breaking work into smaller steps, and refining across turns. Each is scored 0 to 5 based on what the candidate actually does.

Output Evaluation is how well someone catches errors, hallucinations, logical gaps, and unsafe reasoning in AI-generated work. Six more sub-skills cover fact-checking, spotting math and data errors, cross-referencing sources, and recovering when the AI drifts off-task.

We chose these pillars based on research across different industries.

Every pillar and sub-skill is grounded in peer-reviewed research on what high-performing AI users actually do. We chose the skills the research keeps surfacing, set scoring anchors based on real evidence, and only kept the ones we could observe reliably in 30 minutes. We want to score behaviors that have already been proven to predict good work, not ones we hope will.

Sources (11 papers & studies) ▾

Dell'Acqua et al. (2023), Navigating the Jagged Technological Frontier — BCG / Harvard Business School. link
Kosten et al., zero-shot vs. grounded prompting in clinical NLP — Frontiers in AI / JMIR. link
Oliinyk et al., context and format specification effects on LLM output stability. link
IBM Prompt Editing Practices study — taxonomy of enterprise prompt edits. link
Microsoft AI power-user survey (N ≈ 31,000). link
Kupfer et al. (Frontiers in Psychology, N = 93) — verification intensity and decision quality. link
Bowman et al., "sandwiching" study on human–AI team verification behavior. link
Leiser et al. (CHI 2024), HILL hallucination identification tool. link
Tao, automation-bias recovery in nuclear operations. link
Dojo Labs, client-safety of AI numerical claims in consulting. link
GreaTerPrompt benchmark; RSIS, AI Literacy for Future Leaders. link

How we test it

A candidate steps into the seat of a first-year analyst at a strategy firm. They have a clean simulated workspace with the brief, source materials, and AI tools available. The task formats include:

A desk research brief. Draft a partner-ready competitive landscape memo from a short brief and a couple of source documents. Measures context-setting, output specification, and (for longer tasks) breaking the work into pieces.
Output review with seeded errors. Review an AI-generated deliverable that contains a known set of factual and numerical errors — fabricated citations, wrong figures, reversed trends, misquoted statements — against the underlying source. Annotate each error with a source-specific citation. Surfaces verification, math/data error detection, and false-positive discipline.
Steering the AI back on track. A multi-turn task where the AI is set up to break a rule, contradict a source, or shift away from the right audience mid-conversation. Tests whether the candidate notices, pulls the AI back, and gets the work back on course.
Making consistent outputs from a single example. Given one finished example and several raw inputs, the candidate produces a batch of outputs that match the structure and depth of the original. Measures example-driven prompting and how well the candidate keeps the pattern consistent.

Total candidate time across the suite is approximately 35–40 minutes, which is standard for a single consulting interview round.

Process-based scoring

The most consequential design choice is that we score what the candidate did, not only what they produced.

For each sub-skill, the rubric anchors are tied to observable candidate actions. To take a single example: for context-setting, a level-3 candidate names the client, audience, and at least one source document by name in their first prompt. A level-5 candidate does the same, and refines context in subsequent turns rather than repeating it from scratch. A level-1 candidate issues a bare instruction. The platform logs the prompt content; the rubric translates it into a score.

This matters because two candidates can produce comparable final artifacts via very different processes, and the process is the part that predicts on-the-job behavior. Output-only scoring conflates them; process-based scoring separates them.

We are conservative about which sub-skills can be reliably tested in short windows. Task decomposition, for instance, requires enough turns for the structure of the candidate's chain to be observable; we score it as a secondary signal in shorter task formats and a primary signal only in longer ones. The testability profile of each sub-skill is documented in the employer-facing rubric.

Customization without losing comparability

Consulting partners typically want two things at once: an assessment that uses their actual work, and a benchmark that lets them compare candidates against each other and across firms. These are reconcilable.

The assessment runs out of the box with consulting-tuned defaults — a mid-market SaaS client, a lower-middle-market PE target, a regional financial services scenario. No setup required.

It adapts to a pasted job description. The platform parses the JD for vertical, seniority, target deliverable types, and audience, and generates scenario content matched to the role. This is the path most firms will use.

It ingests real firm materials for a more tailored screen — anonymized client briefs, competitor lists, sample partner-read memos, firm terminology glossaries, lists of "AI failure modes we keep seeing." The platform uses these to generate scenarios that look and feel like the firm's actual work.

The rubrics, observable-action definitions, and scoring mechanics are identical across all three modes. What changes is the surface — company names, industry, deliverable specifics, source-document content. What stays fixed is the standard. A candidate's score is therefore comparable across modes and across firms using the same template.

How it fits an existing hiring process

The assessment is designed to slot into the position currently occupied by a take-home case or a structured screening assessment — not to replace behavioral interviews, partner interviews, or final-round cases. Hiring teams retain full ownership of the funnel.

Per candidate, employers receive:

A radar chart of scored sub-skills with a weighted composite score.
A per-sub-skill breakdown with rationale tied to specific observable actions, plus guidance on what a stronger performance on each axis would look like.
A comparison view across candidates with per-axis percentile bands, filterable by role and customization set.

The scorecard is built to be easy for hiring managers to read, defensible when teams calibrate together, and compatible with whatever applicant tracking system and process the firm already uses.

Where we are, and what we're looking for

cubicle is funded through MIT Sandbox and is in active conversations with two consulting firms reviewing our prototype. We are now looking for 3–5 pilot partners: consulting firms (50–300 employees) hiring a meaningful summer or new-hire class who would value early access in exchange for structured feedback.

Pilots are paid, run for one role over three months, and include hands-on customization support. Partners receive the assessment platform, full rubric documentation, candidate scorecards, and the cross-candidate comparison view. We receive the case study and the iteration loop that will shape the next version of the product.

If your firm fits the profile, we would value the conversation.

→ Book a 20-minute pilot conversation.

cubicle is built by Prasiddhi Jain (CEO), Chris Ackerman (CTO/CPO), and Hamza Malik (Technical Product Lead) — three MIT-affiliated founders with backgrounds in technology consulting, edtech, GenAI, and AI engineering. Based in Cambridge, Massachusetts.

← BACK

All field notes