Configuration

This project supports typed programmatic config and TOML-driven runtime config.

Programmatic Library Config

`JobScraperConfig`

JobScraperConfig captures the runtime inputs for a single scrape:

position: job title or search phrase
location: LinkedIn location text
openai_enabled: enable optional description enrichment
openai_model: model name for optional OpenAI enrichment, default gpt-4o-mini
time_posted: TimePosted enum value or matching string
remote: RemoteType enum value or matching string
distance: search radius
advanced_config: optional JobScraperAdvancedConfig overrides

`JobScraperAdvancedConfig`

Use it for optional overrides such as:

LOCATION_MAPPING
KEYWORDS
SKILLS_CATEGORIES

Runtime Enums

TimePosted: ALL, MONTH, WEEK, DAY
RemoteType: ALL, ON-SITE, REMOTE, HYBRID

Runtime TOML Config

The tracked template is runtime.example.toml. A real runtime.toml can define:

[logging]
level = "INFO"
file_name = "main.log"

[storage]
file_name = "linkedin_jobs.sqlite"
state_dir = "artifacts/state"

[scrape.once]
position = "Data Scientist"
location = "Monterrey"
openai_enabled = false
openai_model = "gpt-4o-mini"
time_posted = "DAY"
remote_types = ["REMOTE", "HYBRID", "ON-SITE"]
file_name = "LinkedIn_Jobs_Data_Scientist_Monterrey.csv"
output_dir = "artifacts/jobs"
append = true

[scrape.daily]
cities = ["Monterrey", "Guadalajara", "Mexico City"]
position = "Data Scientist"
openai_enabled = false
openai_model = "gpt-4o-mini"
time_posted = "DAY"
output_dir = "artifacts/jobs"
combined_file_name = "LinkedIn_Jobs_Data_Scientist_Mexico.csv"

[export]
run_id = ""
file_name = "linkedin_jobs_export.csv"
output_dir = "artifacts/jobs"

CLI Override Precedence

The runtime precedence is:

CLI flags
environment overrides
TOML file values
code defaults

Supported env overrides include:

LINKEDIN_WEB_SCRAPER_CONFIG
LINKEDIN_WEB_SCRAPER_LOG_LEVEL
LINKEDIN_WEB_SCRAPER_LOG_FILE
LINKEDIN_WEB_SCRAPER_STORAGE_URL
LINKEDIN_WEB_SCRAPER_STORAGE_FILE
LINKEDIN_WEB_SCRAPER_STATE_DIR
LINKEDIN_WEB_SCRAPER_OUTPUT_DIR
LINKEDIN_WEB_SCRAPER_OPENAI_ENABLED
LINKEDIN_WEB_SCRAPER_OPENAI_MODEL

OpenAI Runtime Behavior

OpenAI support remains optional.

Install LinkedInWebScraper[openai]
Set OPENAI_API_KEY in the environment
Keep secrets out of TOML and out of the repo
Enrichment setup or request failures fall back to the non-enriched dataset
Enriched rows include audit fields such as OpenAIModel, OpenAIResponseId, and OpenAIRawPayload

Artifact And State Paths

Managed defaults resolve under artifacts/:

bare CSV file names -> artifacts/jobs/
bare log file names -> artifacts/logs/
bare SQLite file names -> artifacts/state/

Explicit absolute paths and explicit nested relative paths are preserved.

Storage Model

SQLite persistence is enabled by default for CLI and DailyScrapeService workflows.

scrape_runs stores run lifecycle metadata
jobs stores the latest canonical job attributes
job_snapshots stores the run-level dataframe payload
job_enrichments stores structured OpenAI audit data when present
CSV exports are downstream artifacts written from persisted data

Use build_sqlite_storage_url() for a managed default URL, or inject SQLiteScrapeStorage(storage_url=...) into DailyScrapeService when you need a custom local path or DSN.

Root Runtime Scripts

main.py and process_ds_jobs.py remain available as direct runtime entrypoints for the daily and once workflows.