cv-2026/tracking/CLAUDE.md

# tracking/ — Job-search tracking and Telegram vacancy pipeline

This folder is the operational layer of the job search: the curated channel registry, the live cursor for incremental Telegram pulls, the staging area for messages awaiting triage, and the long-form logs of applications and outreach.

If you (Claude) are about to do anything related to "find vacancies in Telegram", "scan job channels", "what's new in Jobs", "triage a new channel", or similar — this is the file to read first. The main `CLAUDE.md` references it from the Telegram workflow section.

---

## Files at a glance

| File | Purpose | In git? |
|---|---|---|
| `telegram_channels.json` | **Curated source of truth** — per-channel `lang`, `priority`, and filter (`include`/`exclude`). Tunable by hand. | ✅ committed |
| `telegram_state.json` | Per-machine cursor — `last_message_id` and `last_seen_date` per channel. Regenerated automatically. | ❌ gitignored |
| `telegram_inbox.json` | Output of the last fetch run — kept messages only, per channel, with `lang`/`priority` injected. Overwritten each run. | ❌ gitignored |
| `telegram_pending_channels.json` | Generated only when the last run had **new** (untriaged) channels — keyword-frequency scan to bootstrap their curation. Deleted on the next run if no pending. | ❌ gitignored |
| `applications.md` | One row per application — manually maintained, append-only. | ✅ committed |
| `outreach.md` | Cold messages, recruiter pings, follow-ups. One row per touchpoint. | ✅ committed |

---

## Running the pipeline

Two scripts, chainable. Always run from project root.

```bash
~/.local/bin/uv run scripts/list_telegram_channels.py \
  | ~/.local/bin/uv run scripts/fetch_telegram_jobs.py -
```

**Step 1 — `scripts/list_telegram_channels.py`**: reads the live "Jobs" folder from Telegram via Telethon and emits a JSON array of channel usernames (or numeric ids for private channels) to stdout. Always run fresh — Oleg curates the folder manually and adds new channels regularly.

**Step 2 — `scripts/fetch_telegram_jobs.py`**: pulls new messages per channel, applies the per-channel filter, and writes results to `telegram_inbox.json`. Accepts channels as positional args or as a JSON array on stdin (`-`).

### Constants in the fetch script

- `DEFAULT_LOOKBACK_DAYS = 30` — first-time lookback window for new channels (no cursor yet).
- `MAX_PER_CHANNEL = 500` — hard cap on raw messages fetched per channel per run. A channel that posts >500 messages in the lookback window gets `truncated: true` in the output and we silently miss the tail. Tune per scenario (see "Truncation" below).

### Trigger

Vacancy scans run **only when Oleg explicitly asks** (e.g. "забери свежее из Jobs", "что нового в каналах"). No background polling.

---

## telegram_channels.json — schema

Each entry is keyed by `username` (or numeric id for private channels) and is an object:

```jsonc
{
  "<channel_id>": {
    "lang": "ru" | "en" | "...",        // required
    "priority": "p1" | "p2" | "p3",     // required
    "include": <filter_form>,           // optional — absent = trust-all (no positive constraint)
    "exclude": ["kw1", "kw2", ...]      // optional — absent = no negative constraint
  }
}
```

A message **passes the filter** when:
1. **No** `exclude` keyword (case-insensitive substring) is present, AND
2. Every `include` OR-group contributes at least one match.

If both `include` and `exclude` are absent → **trust-all** (every message passes; useful for low-volume personal/digest channels).

### `include` — the four forms

| Form | Semantics | Example |
|---|---|---|
| `[]` or absent | trust-all | _(no constraint)_ |
| `["a", "b"]` | flat OR — at least one matches | `["javascript", "react"]` |
| `[["a", "b"], ["c", "d"]]` | AND of OR-groups — every group needs ≥1 hit | `[["#vacancy","#вакансия"], ["#remote","#удаленка"]]` |
| `[["a","b"], "c"]` | scalars auto-promoted to single-item groups | same as `[["a","b"], ["c"]]` |

### `exclude` — flat list

If **any** keyword in `exclude` appears in the text → the message is **rejected**, even if `include` would have matched. Used to drop wrong-stack postings from generic channels.

Standard Oleg-stack excludes for jobs feeds:
```json
["kafka", "golang", "kotlin", "android", "swift"]
```

For *_jobs channels with hashtag-based filters, add resume excludes too:
```json
["kafka", "golang", "kotlin", "android", "swift", "#резюме", "#resume", "#cv", "#ищуработу"]
```

### Pitfalls

- **Case-insensitive substring matching**, no word boundaries. `"go"` matches "going" / "Goldbelt" / "google" — that's why we use `"golang"` instead. Same trap for `"java"` (matches "javascript"); use `" java "` with spaces, or `"#java "` for hashtag form. For multi-word excludes, pad: `" rust "`, `" ios "`.
- **`react native`** in `exclude` would also block `"react native"` mentions in fullstack postings. Prefer excluding `kotlin`/`android`/`swift`/`flutter` to block mobile, and only block React Native when the channel is mobile-only.
- The same keyword can appear in `include` for one channel and `exclude` for another — they're per-channel, independent.

---

## Priority levels

Set on every channel. Assignment is judged **by the best vacancy seen in a fresh fetch** for that channel — not by volume or hashtag density.

| Level | Meaning | Triage attention |
|---|---|---|
| **p1** | Very relevant — strong stack hits **and** global-remote culture. Posts that Oleg would actually apply to. | Read every kept message. |
| **p2** | Stack OK but culture is internal market (Russian RUB/CIS-only roles), or culture OK but salary band typically misses Oleg's threshold (US-only with low pay, Netherlands on-site). Worth periodic scanning — occasional gems. Market-intel channels (recruiter content) live here too. | Skim, dive into interesting headlines. |
| **p3** | Wrong stack (mobile-native, devops, QA), off-market (Nigeria with ₦ salaries, Netherlands junior on-site), pure chat/noise, founder lifestyle blogs, or dead channels (0 messages in lookback). Subscribed for completeness — Oleg may pivot or want occasional glance. | Glance only on request. |

When triaging the inbox, sort/group by `priority` first, then by `lang`.

---

## Language codes

Free-form short ISO-style codes — pick what fits:
- `ru` — Russian (most curated channels)
- `en` — English
- `mixed` — multi-language channel, when you can't pick a primary
- `nl`, `de`, etc. — for regional boards

This isn't strict; it's a hint for triage attention (Oleg reads ru and en fluently; everything else needs translation overhead).

---

## Triaging a new channel — full procedure

A "new" channel = one that's in the Telegram "Jobs" folder but doesn't have an entry in `telegram_channels.json`. Detected automatically: the fetch script puts its raw messages into `telegram_inbox.json` unfiltered and writes a keyword-frequency scan to `telegram_pending_channels.json`.

Steps to graduate a channel out of pending:

1. **Read `telegram_pending_channels.json`** — for each new channel:
   - `keyword_counts_from_other_channels`: how often every existing keyword (include + exclude across all channels) appears in this channel's recent messages. Quick signal of stack and posting style.
   - `messages_scanned`, `first_run`, `truncated`: volume context.
2. **Open `telegram_inbox.json`** and sample 3–8 messages from this channel directly:
   ```bash
   jq -r '.channels["<channel>"].messages[:5] | .[] | "── \(.date[0:16])\n\(.text[0:400])\n"' tracking/telegram_inbox.json
   ```
   Look for: hashtag patterns, language, post structure (single role vs digest vs chat), recurring noise types.
3. **Decide `lang` and `priority`** using the rubrics above. Base priority on the **best** vacancy in the sample, not the average.
4. **Decide filter shape:**
   - Channel posts proper `#vacancy`/`#вакансия` + `#remote`/`#удаленка` tags → use the standard hashtag AND-of-OR + Oleg-stack excludes (most *_jobs channels).
   - Channel posts vacancy text without consistent hashtags → use **positive stack include** (`["javascript", "typescript", "react", ...]`) + the same Oleg-stack excludes.
   - Channel is low-volume personal/curated content (recruiter musings, market intel) where the value is the whole post → **trust-all** (omit `include` and `exclude`).
   - Channel is a digest that mixes resumes and vacancies (e.g. `javascript_jobs_feed`) → trust-all is usually the right call; filtering `резюме` would drop the whole digest.
   - Channel is mostly noise/wrong stack but worth keeping subscribed → strict positive filter, accept that most runs will return 0.
5. **Add the entry to `telegram_channels.json`**. JSON is hand-edited; keep entries ordered by `priority` then alphabetically for readability.
6. **Rerun the chain.** The channel transitions out of pending. The `telegram_pending_channels.json` file is automatically deleted when no pending channels remain.
7. **Validate** — sample the new `kept` messages and verify nothing wrong is passing or being dropped. If the filter is wrong, edit and rerun (state cursor is fine to keep — incremental fetches re-filter only new messages, so to validate the filter on history you may want to clear state for that channel: `jq 'del(.<channel>)' tracking/telegram_state.json`).

### Sanity-check existing filters

When tuning, always:
- Sample `kept` messages — are they all valid for Oleg?
- For channels with `kept == 0`, **verify with an unfiltered pull** (temporarily remove the channel's entry and rerun for it alone) that nothing legitimate is being thrown away. Don't assume 0 = correct without checking.

---

## Truncation — when the 500-message cap bites

A channel with `"truncated": true` in `telegram_inbox.json` had >500 raw messages in the lookback window. We see the most-recent 500 and silently miss the tail (older portion of the window).

For `*_jobs` Russian channels truncation typically means we covered 1–10 days of a 30-day window. Strict hashtag filters then leave 1–7 kept messages — but the **missed** older messages could contain relevant vacancies.

Options:
- Bump `MAX_PER_CHANNEL` globally (more API calls, longer run).
- Narrow lookback for the busy channel (no per-channel knob today — would require a code change).
- Tune the filter to be stricter so fewer raw messages need processing — only useful if the filter applies at the API level, which substring filters don't.

For now, keep the cap and accept the tail loss for very busy channels; relax only when a specific channel justifies it.

---

## Output of a fetch run

`telegram_inbox.json` structure (overwritten each run):

```jsonc
{
  "generated_at": "2026-06-02T...",
  "lookback_days_for_new_channels": 30,
  "total_in_inbox": <int>,
  "channels": {
    "<channel>": {
      "lang": "ru" | "en" | null,         // null = channel is still "new" / pending
      "priority": "p1" | "p2" | "p3" | null,
      "seen": <int>,                      // raw messages fetched
      "kept": <int>,                      // after filter
      "filtered_out": <int>,
      "first_run": <bool>,                // no prior state cursor
      "truncated": <bool>,                // hit MAX_PER_CHANNEL
      "filter_mode": "filtered (...)" | "trust-all (no filter)" | "unfiltered (new channel — not yet curated)",
      "messages": [
        { "id": <int>, "date": "<ISO>", "text": "...", "has_media": <bool>, "link": "https://t.me/.../id" }
      ]
    }
  }
}
```

Messages are **chronological per channel** (oldest first within each channel).

### Useful jq probes

```bash
# Per-channel summary sorted by kept desc
jq -r '.channels | to_entries | sort_by(.value.kept) | reverse | .[]
  | "\(.key) → kept \(.value.kept)/\(.value.seen) [\(.value.priority // "—")/\(.value.lang // "—")]"' \
  tracking/telegram_inbox.json

# All p1 kept messages
jq '.channels | to_entries | map(select(.value.priority == "p1")) | from_entries' \
  tracking/telegram_inbox.json

# Truncated channels with depth analysis
jq -r '.channels | to_entries | map(select(.value.truncated))
  | .[] | "\(.key): kept \(.value.kept)/\(.value.seen), priority \(.value.priority)"' \
  tracking/telegram_inbox.json
```

---

## After triage

Promising postings → append a row to `applications.md`. Don't accumulate a "seen but skipped" log — the state cursor already prevents re-reading.

For outreach (cold DMs, recruiter conversations) → `outreach.md`, one row per touchpoint.

If Oleg unsubscribes from a channel in Telegram, it disappears from the live folder list, the next run won't fetch it, and its entry in `telegram_channels.json` becomes dead weight. Periodic cleanup is fine but not required — dead entries cost ~150 bytes.