13 KiB

Raw Blame History

tracking/ — Job-search tracking and Telegram vacancy pipeline

This folder is the operational layer of the job search: the curated channel registry, the live cursor for incremental Telegram pulls, the staging area for messages awaiting triage, and the long-form logs of applications and outreach.

If you (Claude) are about to do anything related to "find vacancies in Telegram", "scan job channels", "what's new in Jobs", "triage a new channel", or similar — this is the file to read first. The main CLAUDE.md references it from the Telegram workflow section.

Files at a glance

File	Purpose	In git?
`telegram_channels.json`	Curated source of truth — per-channel `lang`, `priority`, and filter (`include`/`exclude`). Tunable by hand.	✅ committed
`telegram_state.json`	Per-machine cursor — `last_message_id` and `last_seen_date` per channel. Regenerated automatically.	❌ gitignored
`telegram_inbox.json`	Output of the last fetch run — kept messages only, per channel, with `lang`/`priority` injected. Overwritten each run.	❌ gitignored
`telegram_pending_channels.json`	Generated only when the last run had new (untriaged) channels — keyword-frequency scan to bootstrap their curation. Deleted on the next run if no pending.	❌ gitignored
`applications.md`	One row per application — manually maintained, append-only.	✅ committed
`outreach.md`	Cold messages, recruiter pings, follow-ups. One row per touchpoint.	✅ committed

Running the pipeline

Two scripts, chainable. Always run from project root.

~/.local/bin/uv run scripts/list_telegram_channels.py \
  | ~/.local/bin/uv run scripts/fetch_telegram_jobs.py -

Step 1 — scripts/list_telegram_channels.py: reads the live "Jobs" folder from Telegram via Telethon and emits a JSON array of channel usernames (or numeric ids for private channels) to stdout. Always run fresh — Oleg curates the folder manually and adds new channels regularly.

Step 2 — scripts/fetch_telegram_jobs.py: pulls new messages per channel, applies the per-channel filter, and writes results to telegram_inbox.json. Accepts channels as positional args or as a JSON array on stdin (-).

Account: both scripts connect directly via Telethon using TELEGRAM_SESSION_STRING from .env — that must be the usulsu (main) session. The "Jobs" folder lives on that account. Do not put the samuishechka session there.

Constants in the fetch script

DEFAULT_LOOKBACK_DAYS = 30 — first-time lookback window for new channels (no cursor yet).
MAX_PER_CHANNEL = 500 — hard cap on raw messages fetched per channel per run. A channel that posts >500 messages in the lookback window gets truncated: true in the output and we silently miss the tail. Tune per scenario (see "Truncation" below).

Trigger

Vacancy scans run only when Oleg explicitly asks (e.g. "забери свежее из Jobs", "что нового в каналах"). No background polling.

telegram_channels.json — schema

Each entry is keyed by username (or numeric id for private channels) and is an object:

{
  "<channel_id>": {
    "lang": "ru" | "en" | "...",        // required
    "priority": "p1" | "p2" | "p3",     // required
    "include": <filter_form>,           // optional — absent = trust-all (no positive constraint)
    "exclude": ["kw1", "kw2", ...]      // optional — absent = no negative constraint
  }
}

A message passes the filter when:

No exclude keyword (case-insensitive substring) is present, AND
Every include OR-group contributes at least one match.

If both include and exclude are absent → trust-all (every message passes; useful for low-volume personal/digest channels).

`include` — the four forms

Form	Semantics	Example
`[]` or absent	trust-all	(no constraint)
`["a", "b"]`	flat OR — at least one matches	`["javascript", "react"]`
`[["a", "b"], ["c", "d"]]`	AND of OR-groups — every group needs ≥1 hit	`[["#vacancy","#вакансия"], ["#remote","#удаленка"]]`
`[["a","b"], "c"]`	scalars auto-promoted to single-item groups	same as `[["a","b"], ["c"]]`

`exclude` — flat list

If any keyword in exclude appears in the text → the message is rejected, even if include would have matched. Used to drop wrong-stack postings from generic channels.

Standard Oleg-stack excludes for jobs feeds:

["kafka", "golang", "kotlin", "android", "swift"]

For *_jobs channels with hashtag-based filters, add resume excludes too:

["kafka", "golang", "kotlin", "android", "swift", "#резюме", "#resume", "#cv", "#ищуработу"]

Pitfalls

Case-insensitive substring matching, no word boundaries. "go" matches "going" / "Goldbelt" / "google" — that's why we use "golang" instead. Same trap for "java" (matches "javascript"); use " java " with spaces, or "#java " for hashtag form. For multi-word excludes, pad: " rust ", " ios ".
react native in exclude would also block "react native" mentions in fullstack postings. Prefer excluding kotlin/android/swift/flutter to block mobile, and only block React Native when the channel is mobile-only.
The same keyword can appear in include for one channel and exclude for another — they're per-channel, independent.

Priority levels

Set on every channel. Assignment is judged by the best vacancy seen in a fresh fetch for that channel — not by volume or hashtag density.

Level	Meaning	Triage attention
p1	Very relevant — strong stack hits and global-remote culture. Posts that Oleg would actually apply to.	Read every kept message.
p2	Stack OK but culture is internal market (Russian RUB/CIS-only roles), or culture OK but salary band typically misses Oleg's threshold (US-only with low pay, Netherlands on-site). Worth periodic scanning — occasional gems. Market-intel channels (recruiter content) live here too.	Skim, dive into interesting headlines.
p3	Wrong stack (mobile-native, devops, QA), off-market (Nigeria with ₦ salaries, Netherlands junior on-site), pure chat/noise, founder lifestyle blogs, or dead channels (0 messages in lookback). Subscribed for completeness — Oleg may pivot or want occasional glance.	Glance only on request.

When triaging the inbox, sort/group by priority first, then by lang.

Language codes

Free-form short ISO-style codes — pick what fits:

ru — Russian (most curated channels)
en — English
mixed — multi-language channel, when you can't pick a primary
nl, de, etc. — for regional boards

This isn't strict; it's a hint for triage attention (Oleg reads ru and en fluently; everything else needs translation overhead).

Triaging a new channel — full procedure

A "new" channel = one that's in the Telegram "Jobs" folder but doesn't have an entry in telegram_channels.json. Detected automatically: the fetch script puts its raw messages into telegram_inbox.json unfiltered and writes a keyword-frequency scan to telegram_pending_channels.json.

Steps to graduate a channel out of pending:

Read telegram_pending_channels.json — for each new channel:
- keyword_counts_from_other_channels: how often every existing keyword (include + exclude across all channels) appears in this channel's recent messages. Quick signal of stack and posting style.
- messages_scanned, first_run, truncated: volume context.
Open telegram_inbox.json and sample 3–8 messages from this channel directly:
```
jq -r '.channels["<channel>"].messages[:5] | .[] | "── \(.date[0:16])\n\(.text[0:400])\n"' tracking/telegram_inbox.json
```
Look for: hashtag patterns, language, post structure (single role vs digest vs chat), recurring noise types.
Decide lang and priority using the rubrics above. Base priority on the best vacancy in the sample, not the average.
Decide filter shape:
- Channel posts proper #vacancy/#вакансия + #remote/#удаленка tags → use the standard hashtag AND-of-OR + Oleg-stack excludes (most *_jobs channels).
- Channel posts vacancy text without consistent hashtags → use positive stack include (["javascript", "typescript", "react", ...]) + the same Oleg-stack excludes.
- Channel is low-volume personal/curated content (recruiter musings, market intel) where the value is the whole post → trust-all (omit include and exclude).
- Channel is a digest that mixes resumes and vacancies (e.g. javascript_jobs_feed) → trust-all is usually the right call; filtering резюме would drop the whole digest.
- Channel is mostly noise/wrong stack but worth keeping subscribed → strict positive filter, accept that most runs will return 0.
Add the entry to telegram_channels.json. JSON is hand-edited; keep entries ordered by priority then alphabetically for readability.
Rerun the chain. The channel transitions out of pending. The telegram_pending_channels.json file is automatically deleted when no pending channels remain.
Validate — sample the new kept messages and verify nothing wrong is passing or being dropped. If the filter is wrong, edit and rerun (state cursor is fine to keep — incremental fetches re-filter only new messages, so to validate the filter on history you may want to clear state for that channel: jq 'del(.<channel>)' tracking/telegram_state.json).

Sanity-check existing filters

When tuning, always:

Sample kept messages — are they all valid for Oleg?
For channels with kept == 0, verify with an unfiltered pull (temporarily remove the channel's entry and rerun for it alone) that nothing legitimate is being thrown away. Don't assume 0 = correct without checking.

Truncation — when the 500-message cap bites

A channel with "truncated": true in telegram_inbox.json had >500 raw messages in the lookback window. We see the most-recent 500 and silently miss the tail (older portion of the window).

For *_jobs Russian channels truncation typically means we covered 1–10 days of a 30-day window. Strict hashtag filters then leave 1–7 kept messages — but the missed older messages could contain relevant vacancies.

Options:

Bump MAX_PER_CHANNEL globally (more API calls, longer run).
Narrow lookback for the busy channel (no per-channel knob today — would require a code change).
Tune the filter to be stricter so fewer raw messages need processing — only useful if the filter applies at the API level, which substring filters don't.

For now, keep the cap and accept the tail loss for very busy channels; relax only when a specific channel justifies it.

Output of a fetch run

telegram_inbox.json structure (overwritten each run):

{
  "generated_at": "2026-06-02T...",
  "lookback_days_for_new_channels": 30,
  "total_in_inbox": <int>,
  "channels": {
    "<channel>": {
      "lang": "ru" | "en" | null,         // null = channel is still "new" / pending
      "priority": "p1" | "p2" | "p3" | null,
      "seen": <int>,                      // raw messages fetched
      "kept": <int>,                      // after filter
      "filtered_out": <int>,
      "first_run": <bool>,                // no prior state cursor
      "truncated": <bool>,                // hit MAX_PER_CHANNEL
      "filter_mode": "filtered (...)" | "trust-all (no filter)" | "unfiltered (new channel — not yet curated)",
      "messages": [
        { "id": <int>, "date": "<ISO>", "text": "...", "has_media": <bool>, "link": "https://t.me/.../id" }
      ]
    }
  }
}

Messages are chronological per channel (oldest first within each channel).

Useful jq probes

# Per-channel summary sorted by kept desc
jq -r '.channels | to_entries | sort_by(.value.kept) | reverse | .[]
  | "\(.key) → kept \(.value.kept)/\(.value.seen) [\(.value.priority // "—")/\(.value.lang // "—")]"' \
  tracking/telegram_inbox.json

# All p1 kept messages
jq '.channels | to_entries | map(select(.value.priority == "p1")) | from_entries' \
  tracking/telegram_inbox.json

# Truncated channels with depth analysis
jq -r '.channels | to_entries | map(select(.value.truncated))
  | .[] | "\(.key): kept \(.value.kept)/\(.value.seen), priority \(.value.priority)"' \
  tracking/telegram_inbox.json

After triage

Promising postings → append a row to applications.md. Don't accumulate a "seen but skipped" log — the state cursor already prevents re-reading.

For outreach (cold DMs, recruiter conversations) → outreach.md, one row per touchpoint.

If Oleg unsubscribes from a channel in Telegram, it disappears from the live folder list, the next run won't fetch it, and its entry in telegram_channels.json becomes dead weight. Periodic cleanup is fine but not required — dead entries cost ~150 bytes.

13 KiB Raw Blame History Unescape Escape