# Supabase — AI Tooling Engineer — Эссе Q1 (Eval system / instrumentation loop)

## Метаданные

- **Компания:** Supabase
- **Роль:** AI Tooling Engineer
- **Канал подачи:** ashbyhq, через careers-page Supabase
- **Дата подачи:** 29 мая 2026
- **CV использовано:** CV-A (oleg_proskurin_ai_engineer_fullstack_cv.md), pdf-вариант `oleg_proskurin_ai_engineer_fullstack_cv.pdf`
- **Статус:** подано, ожидаем ответа

## Вопрос (дословно)

> Tell us about an eval system, instrumentation loop, or quality framework you built for an AI product or workflow. What did you measure, and how did it change the product?

## Выбранный угол

**Track A — Reliability loop.** Schema-validation-driven retry как ядро, instrumentation per step, добор по cost/cache как data-driven decision. Прямая связка: measure → tune → ship-ready output.

## Финальный текст ответа (English, как подано)

> On PrimeUI (see primeui.com, released Feb 2026) I've built the page generation flow. It takes a project brief and assembles a page from components from a registry where we have more than 200 real React components. The flow should generate content per component, that match the component schema and cohere with entire page content. The hard part is that one page has to satisfy several things at once: coherent content for the page's purpose, a valid component choice from the registry, props matching each component's schema, and a combination of components that holds up against strict design requirements.
>
> I created the per-component selection and content generation via an LLM (Gemini 2.5 Flash via Mastra), one request per step. And I had to build a harness around it. At each step the model picks from a pre-filtered pool of candidates ranked by a UX scoring system that judge how component combinations sit well together. Each step also gets the component registry, the already-generated content, and the page outline. The model returns structured JSON: component name, props object, and reasoning for the choice. The result is validated against the chosen component's schema (AJV with custom fields). On a validation hit the request repeats with an error message in feedback. After N retries it takes a fallback component. Components persist to DB only after they pass.
>
> Each step logs: cache-hit ratio, latency, retry count, and the UX score of the selected components. Running the generation many times with these logs gave me enough data to analyze. I built a Claude Code skill with instructions on how to compare and analyze the logs, used it to detect repeating patterns and weak spots in the generation, and was able to improve the process significantly.
>
> Before tuning, around 5 of 7 steps on a page needed at least one retry, now it's down to 1-2 of 7. Generation time went from 5-7 minutes to 2-3 minutes per page variant. Component schemas were tuned as well to generate semantically better copy. The overall result: generation quality went up, the flow got significantly faster, and token consumption dropped thanks to better cache hits.

## Источники материала

- `cv-master-extended.md` — Block 4 (Eval & guardrails facts: AJV, retry triggers, MAX_GENERATION_RETRIES=5, per-step logging)
- Русский harness-текст из Lucky Hunter / XPN-чатов (последняя вычищенная версия, прошла финализацию в апреле 2026)
- Ранний Q1-драфт из этого же Supabase-чата (с TODO про harness/AJV — там же были числа compact notation 5x, reduced rendering 10-12x, retry 15-20→2-4, time 5-7→2-3 min)
- Уточнения от Олега в этой подаче: метрика «5 of 7 → 1-2 of 7» вместо «15-20 → 2-4», причина ухода от RAG («analyzed, not effective for this scenario»), `up to 150K tokens`, UX scoring как «modeled on designers' and frontend folks' expert judgment»

## Ключевые приёмы (что делает эссе сильным — для будущего реюза)

1. **Структура «проблема → решение → инструментация → измеримый сдвиг → продуктовое последствие».** Каждый абзац отвечает на отдельную часть вопроса.
2. **Connect measure → change.** Конкретные before/after: 5/7 → 1-2/7 ретраев, 5-7min → 2-3min. Без этого ответ распадается на «вот наш сетап».
3. **Claude Code skill как анализ-петля.** Добавлено в финальной итерации Олегом — это и есть «how did it change the product»: не просто метрики, а *использование* метрик для улучшения. Сильнее, чем «мы логировали».
4. **Voice signatures сохранены:** `that match`, `that judge`, `from components from a registry`, `with entire page content`, длинные цепочки через запятую, drop -s в 3rd person, фраза-заголовок + двоеточие (`The overall result:`).
5. **Что вырезано в цикле редактирования:**
   - em-dashes на parentheticals
   - триколоны прилагательных (narrow/predictable/typed → narrow/typed)
   - «not X but Y» конструкции
   - «Worth saying that», «It's important to note»
   - `200+` → `more than 200` (N+ нотация как маркер)
6. **Цена за слово высокая.** ~290 слов финальной версии, ни одного абстрактного прилагательного без числа за ним.

## Что можно адаптировать для других подач

- **Core spine (problem → harness → instrumentation → measured change)** переносится без изменений на любую AI-инженерную роль с уклоном в production reliability.
- **Claude Code skill как анализ-петля** — уникальный приём, добавляет signal зрелого AI-augmented workflow. Использовать когда target отзывается на AI-augmented dev practices.
- **Числа из этого эссе можно повторять буквально:** 5/7 → 1-2/7, 5-7min → 2-3min, up to 150K tokens, more than 200 components, AJV with custom fields. Это всё проверенные формулировки.