230 lines
13 KiB
Markdown
230 lines
13 KiB
Markdown
# tracking/ — Job-search tracking and Telegram vacancy pipeline
|
||
|
||
This folder is the operational layer of the job search: the curated channel registry, the live cursor for incremental Telegram pulls, the staging area for messages awaiting triage, and the long-form logs of applications and outreach.
|
||
|
||
If you (Claude) are about to do anything related to "find vacancies in Telegram", "scan job channels", "what's new in Jobs", "triage a new channel", or similar — this is the file to read first. The main `CLAUDE.md` references it from the Telegram workflow section.
|
||
|
||
---
|
||
|
||
## Files at a glance
|
||
|
||
| File | Purpose | In git? |
|
||
|---|---|---|
|
||
| `telegram_channels.json` | **Curated source of truth** — per-channel `lang`, `priority`, and filter (`include`/`exclude`). Tunable by hand. | ✅ committed |
|
||
| `telegram_state.json` | Per-machine cursor — `last_message_id` and `last_seen_date` per channel. Regenerated automatically. | ❌ gitignored |
|
||
| `telegram_inbox.json` | Output of the last fetch run — kept messages only, per channel, with `lang`/`priority` injected. Overwritten each run. | ❌ gitignored |
|
||
| `telegram_pending_channels.json` | Generated only when the last run had **new** (untriaged) channels — keyword-frequency scan to bootstrap their curation. Deleted on the next run if no pending. | ❌ gitignored |
|
||
| `applications.md` | **Frozen/legacy** — historical application log. Superseded by Trello (BestJob board) as the source of truth. Kept for backward compatibility; no longer read or written. | ✅ committed (frozen) |
|
||
| `outreach.md` | Cold messages, recruiter pings, follow-ups. One row per touchpoint. | ✅ committed |
|
||
|
||
---
|
||
|
||
## Running the pipeline
|
||
|
||
Two scripts, chainable. Always run from project root.
|
||
|
||
```bash
|
||
~/.local/bin/uv run scripts/list_telegram_channels.py \
|
||
| ~/.local/bin/uv run scripts/fetch_telegram_jobs.py -
|
||
```
|
||
|
||
**Step 1 — `scripts/list_telegram_channels.py`**: reads the live "Jobs" folder from Telegram via Telethon and emits a JSON array of channel usernames (or numeric ids for private channels) to stdout. Always run fresh — Oleg curates the folder manually and adds new channels regularly.
|
||
|
||
**Step 2 — `scripts/fetch_telegram_jobs.py`**: pulls new messages per channel, applies the per-channel filter, and writes results to `telegram_inbox.json`. Accepts channels as positional args or as a JSON array on stdin (`-`).
|
||
|
||
**Account:** both scripts connect directly via Telethon using `TELEGRAM_SESSION_STRING` from `.env` — that must be the **usulsu** (main) session. The "Jobs" folder lives on that account. Do not put the samuishechka session there.
|
||
|
||
### Constants in the fetch script
|
||
|
||
- `DEFAULT_LOOKBACK_DAYS = 30` — first-time lookback window for new channels (no cursor yet).
|
||
- `MAX_PER_CHANNEL = 500` — hard cap on raw messages fetched per channel per run. A channel that posts >500 messages in the lookback window gets `truncated: true` in the output and we silently miss the tail. Tune per scenario (see "Truncation" below).
|
||
|
||
### Trigger
|
||
|
||
Vacancy scans run **only when Oleg explicitly asks** (e.g. "забери свежее из Jobs", "что нового в каналах"). No background polling.
|
||
|
||
---
|
||
|
||
## telegram_channels.json — schema
|
||
|
||
Each entry is keyed by `username` (or numeric id for private channels) and is an object:
|
||
|
||
```jsonc
|
||
{
|
||
"<channel_id>": {
|
||
"lang": "ru" | "en" | "...", // required
|
||
"priority": "p1" | "p2" | "p3", // required
|
||
"include": <filter_form>, // optional — absent = trust-all (no positive constraint)
|
||
"exclude": ["kw1", "kw2", ...] // optional — absent = no negative constraint
|
||
}
|
||
}
|
||
```
|
||
|
||
A message **passes the filter** when:
|
||
1. **No** `exclude` keyword (case-insensitive substring) is present, AND
|
||
2. Every `include` OR-group contributes at least one match.
|
||
|
||
If both `include` and `exclude` are absent → **trust-all** (every message passes; useful for low-volume personal/digest channels).
|
||
|
||
### `include` — the four forms
|
||
|
||
| Form | Semantics | Example |
|
||
|---|---|---|
|
||
| `[]` or absent | trust-all | _(no constraint)_ |
|
||
| `["a", "b"]` | flat OR — at least one matches | `["javascript", "react"]` |
|
||
| `[["a", "b"], ["c", "d"]]` | AND of OR-groups — every group needs ≥1 hit | `[["#vacancy","#вакансия"], ["#remote","#удаленка"]]` |
|
||
| `[["a","b"], "c"]` | scalars auto-promoted to single-item groups | same as `[["a","b"], ["c"]]` |
|
||
|
||
### `exclude` — flat list
|
||
|
||
If **any** keyword in `exclude` appears in the text → the message is **rejected**, even if `include` would have matched. Used to drop wrong-stack postings from generic channels.
|
||
|
||
Standard Oleg-stack excludes for jobs feeds:
|
||
```json
|
||
["kafka", "golang", "kotlin", "android", "swift"]
|
||
```
|
||
|
||
For *_jobs channels with hashtag-based filters, add resume excludes too:
|
||
```json
|
||
["kafka", "golang", "kotlin", "android", "swift", "#резюме", "#resume", "#cv", "#ищуработу"]
|
||
```
|
||
|
||
### Pitfalls
|
||
|
||
- **Case-insensitive substring matching**, no word boundaries. `"go"` matches "going" / "Goldbelt" / "google" — that's why we use `"golang"` instead. Same trap for `"java"` (matches "javascript"); use `" java "` with spaces, or `"#java "` for hashtag form. For multi-word excludes, pad: `" rust "`, `" ios "`.
|
||
- **`react native`** in `exclude` would also block `"react native"` mentions in fullstack postings. Prefer excluding `kotlin`/`android`/`swift`/`flutter` to block mobile, and only block React Native when the channel is mobile-only.
|
||
- The same keyword can appear in `include` for one channel and `exclude` for another — they're per-channel, independent.
|
||
|
||
---
|
||
|
||
## Priority levels
|
||
|
||
Set on every channel. Assignment is judged **by the best vacancy seen in a fresh fetch** for that channel — not by volume or hashtag density.
|
||
|
||
| Level | Meaning | Triage attention |
|
||
|---|---|---|
|
||
| **p1** | Very relevant — strong stack hits **and** global-remote culture. Posts that Oleg would actually apply to. | Read every kept message. |
|
||
| **p2** | Stack OK but culture is internal market (Russian RUB/CIS-only roles), or culture OK but salary band typically misses Oleg's threshold (US-only with low pay, Netherlands on-site). Worth periodic scanning — occasional gems. Market-intel channels (recruiter content) live here too. | Skim, dive into interesting headlines. |
|
||
| **p3** | Wrong stack (mobile-native, devops, QA), off-market (Nigeria with ₦ salaries, Netherlands junior on-site), pure chat/noise, founder lifestyle blogs, or dead channels (0 messages in lookback). Subscribed for completeness — Oleg may pivot or want occasional glance. | Glance only on request. |
|
||
|
||
When triaging the inbox, sort/group by `priority` first, then by `lang`.
|
||
|
||
---
|
||
|
||
## Language codes
|
||
|
||
Free-form short ISO-style codes — pick what fits:
|
||
- `ru` — Russian (most curated channels)
|
||
- `en` — English
|
||
- `mixed` — multi-language channel, when you can't pick a primary
|
||
- `nl`, `de`, etc. — for regional boards
|
||
|
||
This isn't strict; it's a hint for triage attention (Oleg reads ru and en fluently; everything else needs translation overhead).
|
||
|
||
---
|
||
|
||
## Triaging a new channel — full procedure
|
||
|
||
A "new" channel = one that's in the Telegram "Jobs" folder but doesn't have an entry in `telegram_channels.json`. Detected automatically: the fetch script puts its raw messages into `telegram_inbox.json` unfiltered and writes a keyword-frequency scan to `telegram_pending_channels.json`.
|
||
|
||
Steps to graduate a channel out of pending:
|
||
|
||
1. **Read `telegram_pending_channels.json`** — for each new channel:
|
||
- `keyword_counts_from_other_channels`: how often every existing keyword (include + exclude across all channels) appears in this channel's recent messages. Quick signal of stack and posting style.
|
||
- `messages_scanned`, `first_run`, `truncated`: volume context.
|
||
2. **Open `telegram_inbox.json`** and sample 3–8 messages from this channel directly:
|
||
```bash
|
||
jq -r '.channels["<channel>"].messages[:5] | .[] | "── \(.date[0:16])\n\(.text[0:400])\n"' tracking/telegram_inbox.json
|
||
```
|
||
Look for: hashtag patterns, language, post structure (single role vs digest vs chat), recurring noise types.
|
||
3. **Decide `lang` and `priority`** using the rubrics above. Base priority on the **best** vacancy in the sample, not the average.
|
||
4. **Decide filter shape:**
|
||
- Channel posts proper `#vacancy`/`#вакансия` + `#remote`/`#удаленка` tags → use the standard hashtag AND-of-OR + Oleg-stack excludes (most *_jobs channels).
|
||
- Channel posts vacancy text without consistent hashtags → use **positive stack include** (`["javascript", "typescript", "react", ...]`) + the same Oleg-stack excludes.
|
||
- Channel is low-volume personal/curated content (recruiter musings, market intel) where the value is the whole post → **trust-all** (omit `include` and `exclude`).
|
||
- Channel is a digest that mixes resumes and vacancies (e.g. `javascript_jobs_feed`) → trust-all is usually the right call; filtering `резюме` would drop the whole digest.
|
||
- Channel is mostly noise/wrong stack but worth keeping subscribed → strict positive filter, accept that most runs will return 0.
|
||
5. **Add the entry to `telegram_channels.json`**. JSON is hand-edited; keep entries ordered by `priority` then alphabetically for readability.
|
||
6. **Rerun the chain.** The channel transitions out of pending. The `telegram_pending_channels.json` file is automatically deleted when no pending channels remain.
|
||
7. **Validate** — sample the new `kept` messages and verify nothing wrong is passing or being dropped. If the filter is wrong, edit and rerun (state cursor is fine to keep — incremental fetches re-filter only new messages, so to validate the filter on history you may want to clear state for that channel: `jq 'del(.<channel>)' tracking/telegram_state.json`).
|
||
|
||
### Sanity-check existing filters
|
||
|
||
When tuning, always:
|
||
- Sample `kept` messages — are they all valid for Oleg?
|
||
- For channels with `kept == 0`, **verify with an unfiltered pull** (temporarily remove the channel's entry and rerun for it alone) that nothing legitimate is being thrown away. Don't assume 0 = correct without checking.
|
||
|
||
---
|
||
|
||
## Truncation — when the 500-message cap bites
|
||
|
||
A channel with `"truncated": true` in `telegram_inbox.json` had >500 raw messages in the lookback window. We see the most-recent 500 and silently miss the tail (older portion of the window).
|
||
|
||
For `*_jobs` Russian channels truncation typically means we covered 1–10 days of a 30-day window. Strict hashtag filters then leave 1–7 kept messages — but the **missed** older messages could contain relevant vacancies.
|
||
|
||
Options:
|
||
- Bump `MAX_PER_CHANNEL` globally (more API calls, longer run).
|
||
- Narrow lookback for the busy channel (no per-channel knob today — would require a code change).
|
||
- Tune the filter to be stricter so fewer raw messages need processing — only useful if the filter applies at the API level, which substring filters don't.
|
||
|
||
For now, keep the cap and accept the tail loss for very busy channels; relax only when a specific channel justifies it.
|
||
|
||
---
|
||
|
||
## Output of a fetch run
|
||
|
||
`telegram_inbox.json` structure (overwritten each run):
|
||
|
||
```jsonc
|
||
{
|
||
"generated_at": "2026-06-02T...",
|
||
"lookback_days_for_new_channels": 30,
|
||
"total_in_inbox": <int>,
|
||
"channels": {
|
||
"<channel>": {
|
||
"lang": "ru" | "en" | null, // null = channel is still "new" / pending
|
||
"priority": "p1" | "p2" | "p3" | null,
|
||
"seen": <int>, // raw messages fetched
|
||
"kept": <int>, // after filter
|
||
"filtered_out": <int>,
|
||
"first_run": <bool>, // no prior state cursor
|
||
"truncated": <bool>, // hit MAX_PER_CHANNEL
|
||
"filter_mode": "filtered (...)" | "trust-all (no filter)" | "unfiltered (new channel — not yet curated)",
|
||
"messages": [
|
||
{ "id": <int>, "date": "<ISO>", "text": "...", "has_media": <bool>, "link": "https://t.me/.../id" }
|
||
]
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
Messages are **chronological per channel** (oldest first within each channel).
|
||
|
||
### Useful jq probes
|
||
|
||
```bash
|
||
# Per-channel summary sorted by kept desc
|
||
jq -r '.channels | to_entries | sort_by(.value.kept) | reverse | .[]
|
||
| "\(.key) → kept \(.value.kept)/\(.value.seen) [\(.value.priority // "—")/\(.value.lang // "—")]"' \
|
||
tracking/telegram_inbox.json
|
||
|
||
# All p1 kept messages
|
||
jq '.channels | to_entries | map(select(.value.priority == "p1")) | from_entries' \
|
||
tracking/telegram_inbox.json
|
||
|
||
# Truncated channels with depth analysis
|
||
jq -r '.channels | to_entries | map(select(.value.truncated))
|
||
| .[] | "\(.key): kept \(.value.kept)/\(.value.seen), priority \(.value.priority)"' \
|
||
tracking/telegram_inbox.json
|
||
```
|
||
|
||
---
|
||
|
||
## After triage
|
||
|
||
Promising postings → **create a Trello card** on the BestJob board (TODO column). The card is the application record — see the `triage-jobs` skill ("After triage — update tracking") for the card schema. Trello is the source of truth for applications; `applications.md` is frozen/legacy and is neither read nor written. Don't accumulate a "seen but skipped" log — the state cursor already prevents re-reading.
|
||
|
||
For outreach (cold DMs, recruiter conversations) → `outreach.md`, one row per touchpoint.
|
||
|
||
If Oleg unsubscribes from a channel in Telegram, it disappears from the live folder list, the next run won't fetch it, and its entry in `telegram_channels.json` becomes dead weight. Periodic cleanup is fine but not required — dead entries cost ~150 bytes.
|