cv-2026/tracking/CLAUDE.md

228 lines
13 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# tracking/ — Job-search tracking and Telegram vacancy pipeline
This folder is the operational layer of the job search: the curated channel registry, the live cursor for incremental Telegram pulls, the staging area for messages awaiting triage, and the long-form logs of applications and outreach.
If you (Claude) are about to do anything related to "find vacancies in Telegram", "scan job channels", "what's new in Jobs", "triage a new channel", or similar — this is the file to read first. The main `CLAUDE.md` references it from the Telegram workflow section.
---
## Files at a glance
| File | Purpose | In git? |
|---|---|---|
| `telegram_channels.json` | **Curated source of truth** — per-channel `lang`, `priority`, and filter (`include`/`exclude`). Tunable by hand. | ✅ committed |
| `telegram_state.json` | Per-machine cursor — `last_message_id` and `last_seen_date` per channel. Regenerated automatically. | ❌ gitignored |
| `telegram_inbox.json` | Output of the last fetch run — kept messages only, per channel, with `lang`/`priority` injected. Overwritten each run. | ❌ gitignored |
| `telegram_pending_channels.json` | Generated only when the last run had **new** (untriaged) channels — keyword-frequency scan to bootstrap their curation. Deleted on the next run if no pending. | ❌ gitignored |
| `applications.md` | One row per application — manually maintained, append-only. | ✅ committed |
| `outreach.md` | Cold messages, recruiter pings, follow-ups. One row per touchpoint. | ✅ committed |
---
## Running the pipeline
Two scripts, chainable. Always run from project root.
```bash
~/.local/bin/uv run scripts/list_telegram_channels.py \
| ~/.local/bin/uv run scripts/fetch_telegram_jobs.py -
```
**Step 1 — `scripts/list_telegram_channels.py`**: reads the live "Jobs" folder from Telegram via Telethon and emits a JSON array of channel usernames (or numeric ids for private channels) to stdout. Always run fresh — Oleg curates the folder manually and adds new channels regularly.
**Step 2 — `scripts/fetch_telegram_jobs.py`**: pulls new messages per channel, applies the per-channel filter, and writes results to `telegram_inbox.json`. Accepts channels as positional args or as a JSON array on stdin (`-`).
### Constants in the fetch script
- `DEFAULT_LOOKBACK_DAYS = 30` — first-time lookback window for new channels (no cursor yet).
- `MAX_PER_CHANNEL = 500` — hard cap on raw messages fetched per channel per run. A channel that posts >500 messages in the lookback window gets `truncated: true` in the output and we silently miss the tail. Tune per scenario (see "Truncation" below).
### Trigger
Vacancy scans run **only when Oleg explicitly asks** (e.g. "забери свежее из Jobs", "что нового в каналах"). No background polling.
---
## telegram_channels.json — schema
Each entry is keyed by `username` (or numeric id for private channels) and is an object:
```jsonc
{
"<channel_id>": {
"lang": "ru" | "en" | "...", // required
"priority": "p1" | "p2" | "p3", // required
"include": <filter_form>, // optional — absent = trust-all (no positive constraint)
"exclude": ["kw1", "kw2", ...] // optional — absent = no negative constraint
}
}
```
A message **passes the filter** when:
1. **No** `exclude` keyword (case-insensitive substring) is present, AND
2. Every `include` OR-group contributes at least one match.
If both `include` and `exclude` are absent → **trust-all** (every message passes; useful for low-volume personal/digest channels).
### `include` — the four forms
| Form | Semantics | Example |
|---|---|---|
| `[]` or absent | trust-all | _(no constraint)_ |
| `["a", "b"]` | flat OR — at least one matches | `["javascript", "react"]` |
| `[["a", "b"], ["c", "d"]]` | AND of OR-groups — every group needs ≥1 hit | `[["#vacancy","#вакансия"], ["#remote","#удаленка"]]` |
| `[["a","b"], "c"]` | scalars auto-promoted to single-item groups | same as `[["a","b"], ["c"]]` |
### `exclude` — flat list
If **any** keyword in `exclude` appears in the text → the message is **rejected**, even if `include` would have matched. Used to drop wrong-stack postings from generic channels.
Standard Oleg-stack excludes for jobs feeds:
```json
["kafka", "golang", "kotlin", "android", "swift"]
```
For *_jobs channels with hashtag-based filters, add resume excludes too:
```json
["kafka", "golang", "kotlin", "android", "swift", "#резюме", "#resume", "#cv", "#ищуработу"]
```
### Pitfalls
- **Case-insensitive substring matching**, no word boundaries. `"go"` matches "going" / "Goldbelt" / "google" — that's why we use `"golang"` instead. Same trap for `"java"` (matches "javascript"); use `" java "` with spaces, or `"#java "` for hashtag form. For multi-word excludes, pad: `" rust "`, `" ios "`.
- **`react native`** in `exclude` would also block `"react native"` mentions in fullstack postings. Prefer excluding `kotlin`/`android`/`swift`/`flutter` to block mobile, and only block React Native when the channel is mobile-only.
- The same keyword can appear in `include` for one channel and `exclude` for another — they're per-channel, independent.
---
## Priority levels
Set on every channel. Assignment is judged **by the best vacancy seen in a fresh fetch** for that channel — not by volume or hashtag density.
| Level | Meaning | Triage attention |
|---|---|---|
| **p1** | Very relevant — strong stack hits **and** global-remote culture. Posts that Oleg would actually apply to. | Read every kept message. |
| **p2** | Stack OK but culture is internal market (Russian RUB/CIS-only roles), or culture OK but salary band typically misses Oleg's threshold (US-only with low pay, Netherlands on-site). Worth periodic scanning — occasional gems. Market-intel channels (recruiter content) live here too. | Skim, dive into interesting headlines. |
| **p3** | Wrong stack (mobile-native, devops, QA), off-market (Nigeria with ₦ salaries, Netherlands junior on-site), pure chat/noise, founder lifestyle blogs, or dead channels (0 messages in lookback). Subscribed for completeness — Oleg may pivot or want occasional glance. | Glance only on request. |
When triaging the inbox, sort/group by `priority` first, then by `lang`.
---
## Language codes
Free-form short ISO-style codes — pick what fits:
- `ru` — Russian (most curated channels)
- `en` — English
- `mixed` — multi-language channel, when you can't pick a primary
- `nl`, `de`, etc. — for regional boards
This isn't strict; it's a hint for triage attention (Oleg reads ru and en fluently; everything else needs translation overhead).
---
## Triaging a new channel — full procedure
A "new" channel = one that's in the Telegram "Jobs" folder but doesn't have an entry in `telegram_channels.json`. Detected automatically: the fetch script puts its raw messages into `telegram_inbox.json` unfiltered and writes a keyword-frequency scan to `telegram_pending_channels.json`.
Steps to graduate a channel out of pending:
1. **Read `telegram_pending_channels.json`** — for each new channel:
- `keyword_counts_from_other_channels`: how often every existing keyword (include + exclude across all channels) appears in this channel's recent messages. Quick signal of stack and posting style.
- `messages_scanned`, `first_run`, `truncated`: volume context.
2. **Open `telegram_inbox.json`** and sample 38 messages from this channel directly:
```bash
jq -r '.channels["<channel>"].messages[:5] | .[] | "── \(.date[0:16])\n\(.text[0:400])\n"' tracking/telegram_inbox.json
```
Look for: hashtag patterns, language, post structure (single role vs digest vs chat), recurring noise types.
3. **Decide `lang` and `priority`** using the rubrics above. Base priority on the **best** vacancy in the sample, not the average.
4. **Decide filter shape:**
- Channel posts proper `#vacancy`/`#вакансия` + `#remote`/`#удаленка` tags → use the standard hashtag AND-of-OR + Oleg-stack excludes (most *_jobs channels).
- Channel posts vacancy text without consistent hashtags → use **positive stack include** (`["javascript", "typescript", "react", ...]`) + the same Oleg-stack excludes.
- Channel is low-volume personal/curated content (recruiter musings, market intel) where the value is the whole post → **trust-all** (omit `include` and `exclude`).
- Channel is a digest that mixes resumes and vacancies (e.g. `javascript_jobs_feed`) → trust-all is usually the right call; filtering `резюме` would drop the whole digest.
- Channel is mostly noise/wrong stack but worth keeping subscribed → strict positive filter, accept that most runs will return 0.
5. **Add the entry to `telegram_channels.json`**. JSON is hand-edited; keep entries ordered by `priority` then alphabetically for readability.
6. **Rerun the chain.** The channel transitions out of pending. The `telegram_pending_channels.json` file is automatically deleted when no pending channels remain.
7. **Validate** — sample the new `kept` messages and verify nothing wrong is passing or being dropped. If the filter is wrong, edit and rerun (state cursor is fine to keep — incremental fetches re-filter only new messages, so to validate the filter on history you may want to clear state for that channel: `jq 'del(.<channel>)' tracking/telegram_state.json`).
### Sanity-check existing filters
When tuning, always:
- Sample `kept` messages — are they all valid for Oleg?
- For channels with `kept == 0`, **verify with an unfiltered pull** (temporarily remove the channel's entry and rerun for it alone) that nothing legitimate is being thrown away. Don't assume 0 = correct without checking.
---
## Truncation — when the 500-message cap bites
A channel with `"truncated": true` in `telegram_inbox.json` had >500 raw messages in the lookback window. We see the most-recent 500 and silently miss the tail (older portion of the window).
For `*_jobs` Russian channels truncation typically means we covered 110 days of a 30-day window. Strict hashtag filters then leave 17 kept messages — but the **missed** older messages could contain relevant vacancies.
Options:
- Bump `MAX_PER_CHANNEL` globally (more API calls, longer run).
- Narrow lookback for the busy channel (no per-channel knob today — would require a code change).
- Tune the filter to be stricter so fewer raw messages need processing — only useful if the filter applies at the API level, which substring filters don't.
For now, keep the cap and accept the tail loss for very busy channels; relax only when a specific channel justifies it.
---
## Output of a fetch run
`telegram_inbox.json` structure (overwritten each run):
```jsonc
{
"generated_at": "2026-06-02T...",
"lookback_days_for_new_channels": 30,
"total_in_inbox": <int>,
"channels": {
"<channel>": {
"lang": "ru" | "en" | null, // null = channel is still "new" / pending
"priority": "p1" | "p2" | "p3" | null,
"seen": <int>, // raw messages fetched
"kept": <int>, // after filter
"filtered_out": <int>,
"first_run": <bool>, // no prior state cursor
"truncated": <bool>, // hit MAX_PER_CHANNEL
"filter_mode": "filtered (...)" | "trust-all (no filter)" | "unfiltered (new channel — not yet curated)",
"messages": [
{ "id": <int>, "date": "<ISO>", "text": "...", "has_media": <bool>, "link": "https://t.me/.../id" }
]
}
}
}
```
Messages are **chronological per channel** (oldest first within each channel).
### Useful jq probes
```bash
# Per-channel summary sorted by kept desc
jq -r '.channels | to_entries | sort_by(.value.kept) | reverse | .[]
| "\(.key) → kept \(.value.kept)/\(.value.seen) [\(.value.priority // "—")/\(.value.lang // "—")]"' \
tracking/telegram_inbox.json
# All p1 kept messages
jq '.channels | to_entries | map(select(.value.priority == "p1")) | from_entries' \
tracking/telegram_inbox.json
# Truncated channels with depth analysis
jq -r '.channels | to_entries | map(select(.value.truncated))
| .[] | "\(.key): kept \(.value.kept)/\(.value.seen), priority \(.value.priority)"' \
tracking/telegram_inbox.json
```
---
## After triage
Promising postings → append a row to `applications.md`. Don't accumulate a "seen but skipped" log — the state cursor already prevents re-reading.
For outreach (cold DMs, recruiter conversations) → `outreach.md`, one row per touchpoint.
If Oleg unsubscribes from a channel in Telegram, it disappears from the live folder list, the next run won't fetch it, and its entry in `telegram_channels.json` becomes dead weight. Periodic cleanup is fine but not required — dead entries cost ~150 bytes.