# tracking/ — Job-search tracking and Telegram vacancy pipeline This folder is the operational layer of the job search: the curated channel registry, the live cursor for incremental Telegram pulls, the staging area for messages awaiting triage, and the long-form logs of applications and outreach. If you (Claude) are about to do anything related to "find vacancies in Telegram", "scan job channels", "what's new in Jobs", "triage a new channel", or similar — this is the file to read first. The main `CLAUDE.md` references it from the Telegram workflow section. --- ## Files at a glance | File | Purpose | In git? | |---|---|---| | `telegram_channels.json` | **Curated source of truth** — per-channel `lang`, `priority`, and filter (`include`/`exclude`). Tunable by hand. | ✅ committed | | `telegram_state.json` | Per-machine cursor — `last_message_id` and `last_seen_date` per channel. Regenerated automatically. | ❌ gitignored | | `telegram_inbox.json` | Output of the last fetch run — kept messages only, per channel, with `lang`/`priority` injected. Overwritten each run. | ❌ gitignored | | `telegram_pending_channels.json` | Generated only when the last run had **new** (untriaged) channels — keyword-frequency scan to bootstrap their curation. Deleted on the next run if no pending. | ❌ gitignored | | `applications.md` | **Frozen/legacy** — historical application log. Superseded by Trello (BestJob board) as the source of truth. Kept for backward compatibility; no longer read or written. | ✅ committed (frozen) | | `outreach.md` | Cold messages, recruiter pings, follow-ups. One row per touchpoint. | ✅ committed | --- ## Running the pipeline Two scripts, chainable. Always run from project root. ```bash ~/.local/bin/uv run scripts/list_telegram_channels.py \ | ~/.local/bin/uv run scripts/fetch_telegram_jobs.py - ``` **Step 1 — `scripts/list_telegram_channels.py`**: reads the live "Jobs" folder from Telegram via Telethon and emits a JSON array of channel usernames (or numeric ids for private channels) to stdout. Always run fresh — Oleg curates the folder manually and adds new channels regularly. **Step 2 — `scripts/fetch_telegram_jobs.py`**: pulls new messages per channel, applies the per-channel filter, and writes results to `telegram_inbox.json`. Accepts channels as positional args or as a JSON array on stdin (`-`). **Account:** both scripts connect directly via Telethon using `TELEGRAM_SESSION_STRING` from `.env` — that must be the **usulsu** (main) session. The "Jobs" folder lives on that account. Do not put the samuishechka session there. ### Constants in the fetch script - `DEFAULT_LOOKBACK_DAYS = 30` — first-time lookback window for new channels (no cursor yet). - `MAX_PER_CHANNEL = 500` — hard cap on raw messages fetched per channel per run. A channel that posts >500 messages in the lookback window gets `truncated: true` in the output and we silently miss the tail. Tune per scenario (see "Truncation" below). ### Trigger Vacancy scans run **only when Oleg explicitly asks** (e.g. "забери свежее из Jobs", "что нового в каналах"). No background polling. --- ## telegram_channels.json — schema Each entry is keyed by `username` (or numeric id for private channels) and is an object: ```jsonc { "": { "lang": "ru" | "en" | "...", // required "priority": "p1" | "p2" | "p3", // required "include": , // optional — absent = trust-all (no positive constraint) "exclude": ["kw1", "kw2", ...] // optional — absent = no negative constraint } } ``` A message **passes the filter** when: 1. **No** `exclude` keyword (case-insensitive substring) is present, AND 2. Every `include` OR-group contributes at least one match. If both `include` and `exclude` are absent → **trust-all** (every message passes; useful for low-volume personal/digest channels). ### `include` — the four forms | Form | Semantics | Example | |---|---|---| | `[]` or absent | trust-all | _(no constraint)_ | | `["a", "b"]` | flat OR — at least one matches | `["javascript", "react"]` | | `[["a", "b"], ["c", "d"]]` | AND of OR-groups — every group needs ≥1 hit | `[["#vacancy","#вакансия"], ["#remote","#удаленка"]]` | | `[["a","b"], "c"]` | scalars auto-promoted to single-item groups | same as `[["a","b"], ["c"]]` | ### `exclude` — flat list If **any** keyword in `exclude` appears in the text → the message is **rejected**, even if `include` would have matched. Used to drop wrong-stack postings from generic channels. Standard Oleg-stack excludes for jobs feeds: ```json ["kafka", "golang", "kotlin", "android", "swift"] ``` For *_jobs channels with hashtag-based filters, add resume excludes too: ```json ["kafka", "golang", "kotlin", "android", "swift", "#резюме", "#resume", "#cv", "#ищуработу"] ``` ### Pitfalls - **Case-insensitive substring matching**, no word boundaries. `"go"` matches "going" / "Goldbelt" / "google" — that's why we use `"golang"` instead. Same trap for `"java"` (matches "javascript"); use `" java "` with spaces, or `"#java "` for hashtag form. For multi-word excludes, pad: `" rust "`, `" ios "`. - **`react native`** in `exclude` would also block `"react native"` mentions in fullstack postings. Prefer excluding `kotlin`/`android`/`swift`/`flutter` to block mobile, and only block React Native when the channel is mobile-only. - The same keyword can appear in `include` for one channel and `exclude` for another — they're per-channel, independent. --- ## Priority levels Set on every channel. Assignment is judged **by the best vacancy seen in a fresh fetch** for that channel — not by volume or hashtag density. | Level | Meaning | Triage attention | |---|---|---| | **p1** | Very relevant — strong stack hits **and** global-remote culture. Posts that Oleg would actually apply to. | Read every kept message. | | **p2** | Stack OK but culture is internal market (Russian RUB/CIS-only roles), or culture OK but salary band typically misses Oleg's threshold (US-only with low pay, Netherlands on-site). Worth periodic scanning — occasional gems. Market-intel channels (recruiter content) live here too. | Skim, dive into interesting headlines. | | **p3** | Wrong stack (mobile-native, devops, QA), off-market (Nigeria with ₦ salaries, Netherlands junior on-site), pure chat/noise, founder lifestyle blogs, or dead channels (0 messages in lookback). Subscribed for completeness — Oleg may pivot or want occasional glance. | Glance only on request. | When triaging the inbox, sort/group by `priority` first, then by `lang`. --- ## Language codes Free-form short ISO-style codes — pick what fits: - `ru` — Russian (most curated channels) - `en` — English - `mixed` — multi-language channel, when you can't pick a primary - `nl`, `de`, etc. — for regional boards This isn't strict; it's a hint for triage attention (Oleg reads ru and en fluently; everything else needs translation overhead). --- ## Triaging a new channel — full procedure A "new" channel = one that's in the Telegram "Jobs" folder but doesn't have an entry in `telegram_channels.json`. Detected automatically: the fetch script puts its raw messages into `telegram_inbox.json` unfiltered and writes a keyword-frequency scan to `telegram_pending_channels.json`. Steps to graduate a channel out of pending: 1. **Read `telegram_pending_channels.json`** — for each new channel: - `keyword_counts_from_other_channels`: how often every existing keyword (include + exclude across all channels) appears in this channel's recent messages. Quick signal of stack and posting style. - `messages_scanned`, `first_run`, `truncated`: volume context. 2. **Open `telegram_inbox.json`** and sample 3–8 messages from this channel directly: ```bash jq -r '.channels[""].messages[:5] | .[] | "── \(.date[0:16])\n\(.text[0:400])\n"' tracking/telegram_inbox.json ``` Look for: hashtag patterns, language, post structure (single role vs digest vs chat), recurring noise types. 3. **Decide `lang` and `priority`** using the rubrics above. Base priority on the **best** vacancy in the sample, not the average. 4. **Decide filter shape:** - Channel posts proper `#vacancy`/`#вакансия` + `#remote`/`#удаленка` tags → use the standard hashtag AND-of-OR + Oleg-stack excludes (most *_jobs channels). - Channel posts vacancy text without consistent hashtags → use **positive stack include** (`["javascript", "typescript", "react", ...]`) + the same Oleg-stack excludes. - Channel is low-volume personal/curated content (recruiter musings, market intel) where the value is the whole post → **trust-all** (omit `include` and `exclude`). - Channel is a digest that mixes resumes and vacancies (e.g. `javascript_jobs_feed`) → trust-all is usually the right call; filtering `резюме` would drop the whole digest. - Channel is mostly noise/wrong stack but worth keeping subscribed → strict positive filter, accept that most runs will return 0. 5. **Add the entry to `telegram_channels.json`**. JSON is hand-edited; keep entries ordered by `priority` then alphabetically for readability. 6. **Rerun the chain.** The channel transitions out of pending. The `telegram_pending_channels.json` file is automatically deleted when no pending channels remain. 7. **Validate** — sample the new `kept` messages and verify nothing wrong is passing or being dropped. If the filter is wrong, edit and rerun (state cursor is fine to keep — incremental fetches re-filter only new messages, so to validate the filter on history you may want to clear state for that channel: `jq 'del(.)' tracking/telegram_state.json`). ### Sanity-check existing filters When tuning, always: - Sample `kept` messages — are they all valid for Oleg? - For channels with `kept == 0`, **verify with an unfiltered pull** (temporarily remove the channel's entry and rerun for it alone) that nothing legitimate is being thrown away. Don't assume 0 = correct without checking. --- ## Truncation — when the 500-message cap bites A channel with `"truncated": true` in `telegram_inbox.json` had >500 raw messages in the lookback window. We see the most-recent 500 and silently miss the tail (older portion of the window). For `*_jobs` Russian channels truncation typically means we covered 1–10 days of a 30-day window. Strict hashtag filters then leave 1–7 kept messages — but the **missed** older messages could contain relevant vacancies. Options: - Bump `MAX_PER_CHANNEL` globally (more API calls, longer run). - Narrow lookback for the busy channel (no per-channel knob today — would require a code change). - Tune the filter to be stricter so fewer raw messages need processing — only useful if the filter applies at the API level, which substring filters don't. For now, keep the cap and accept the tail loss for very busy channels; relax only when a specific channel justifies it. --- ## Output of a fetch run `telegram_inbox.json` structure (overwritten each run): ```jsonc { "generated_at": "2026-06-02T...", "lookback_days_for_new_channels": 30, "total_in_inbox": , "channels": { "": { "lang": "ru" | "en" | null, // null = channel is still "new" / pending "priority": "p1" | "p2" | "p3" | null, "seen": , // raw messages fetched "kept": , // after filter "filtered_out": , "first_run": , // no prior state cursor "truncated": , // hit MAX_PER_CHANNEL "filter_mode": "filtered (...)" | "trust-all (no filter)" | "unfiltered (new channel — not yet curated)", "messages": [ { "id": , "date": "", "text": "...", "has_media": , "link": "https://t.me/.../id" } ] } } } ``` Messages are **chronological per channel** (oldest first within each channel). ### Useful jq probes ```bash # Per-channel summary sorted by kept desc jq -r '.channels | to_entries | sort_by(.value.kept) | reverse | .[] | "\(.key) → kept \(.value.kept)/\(.value.seen) [\(.value.priority // "—")/\(.value.lang // "—")]"' \ tracking/telegram_inbox.json # All p1 kept messages jq '.channels | to_entries | map(select(.value.priority == "p1")) | from_entries' \ tracking/telegram_inbox.json # Truncated channels with depth analysis jq -r '.channels | to_entries | map(select(.value.truncated)) | .[] | "\(.key): kept \(.value.kept)/\(.value.seen), priority \(.value.priority)"' \ tracking/telegram_inbox.json ``` --- ## After triage Promising postings → **create a Trello card** on the BestJob board (TODO column). The card is the application record — see the `triage-jobs` skill ("After triage — update tracking") for the card schema. Trello is the source of truth for applications; `applications.md` is frozen/legacy and is neither read nor written. Don't accumulate a "seen but skipped" log — the state cursor already prevents re-reading. For outreach (cold DMs, recruiter conversations) → `outreach.md`, one row per touchpoint. If Oleg unsubscribes from a channel in Telegram, it disappears from the live folder list, the next run won't fetch it, and its entry in `telegram_channels.json` becomes dead weight. Periodic cleanup is fine but not required — dead entries cost ~150 bytes.