Configuration
This project supports typed programmatic config and TOML-driven runtime config.
Programmatic Library Config
JobScraperConfig
JobScraperConfig captures the runtime inputs for a single scrape:
position: job title or search phraselocation: LinkedIn location textopenai_enabled: enable optional description enrichmentopenai_model: model name for optional OpenAI enrichment, defaultgpt-4o-minitime_posted:TimePostedenum value or matching stringremote:RemoteTypeenum value or matching stringdistance: search radiusadvanced_config: optionalJobScraperAdvancedConfigoverrides
JobScraperAdvancedConfig
Use it for optional overrides such as:
LOCATION_MAPPINGKEYWORDSSKILLS_CATEGORIES
Runtime Enums
TimePosted:ALL,MONTH,WEEK,DAYRemoteType:ALL,ON-SITE,REMOTE,HYBRID
Runtime TOML Config
The tracked template is runtime.example.toml. A real runtime.toml can define:
[logging]
level = "INFO"
file_name = "main.log"
[storage]
file_name = "linkedin_jobs.sqlite"
state_dir = "artifacts/state"
[scrape.once]
position = "Data Scientist"
location = "Monterrey"
openai_enabled = false
openai_model = "gpt-4o-mini"
time_posted = "DAY"
remote_types = ["REMOTE", "HYBRID", "ON-SITE"]
file_name = "LinkedIn_Jobs_Data_Scientist_Monterrey.csv"
output_dir = "artifacts/jobs"
append = true
[scrape.daily]
cities = ["Monterrey", "Guadalajara", "Mexico City"]
position = "Data Scientist"
openai_enabled = false
openai_model = "gpt-4o-mini"
time_posted = "DAY"
output_dir = "artifacts/jobs"
combined_file_name = "LinkedIn_Jobs_Data_Scientist_Mexico.csv"
[export]
run_id = ""
file_name = "linkedin_jobs_export.csv"
output_dir = "artifacts/jobs"
CLI Override Precedence
The runtime precedence is:
- CLI flags
- environment overrides
- TOML file values
- code defaults
Supported env overrides include:
LINKEDIN_WEB_SCRAPER_CONFIGLINKEDIN_WEB_SCRAPER_LOG_LEVELLINKEDIN_WEB_SCRAPER_LOG_FILELINKEDIN_WEB_SCRAPER_STORAGE_URLLINKEDIN_WEB_SCRAPER_STORAGE_FILELINKEDIN_WEB_SCRAPER_STATE_DIRLINKEDIN_WEB_SCRAPER_OUTPUT_DIRLINKEDIN_WEB_SCRAPER_OPENAI_ENABLEDLINKEDIN_WEB_SCRAPER_OPENAI_MODEL
OpenAI Runtime Behavior
OpenAI support remains optional.
- Install
LinkedInWebScraper[openai] - Set
OPENAI_API_KEYin the environment - Keep secrets out of TOML and out of the repo
- Enrichment setup or request failures fall back to the non-enriched dataset
- Enriched rows include audit fields such as
OpenAIModel,OpenAIResponseId, andOpenAIRawPayload
Artifact And State Paths
Managed defaults resolve under artifacts/:
- bare CSV file names ->
artifacts/jobs/ - bare log file names ->
artifacts/logs/ - bare SQLite file names ->
artifacts/state/
Explicit absolute paths and explicit nested relative paths are preserved.
Storage Model
SQLite persistence is enabled by default for CLI and DailyScrapeService workflows.
scrape_runsstores run lifecycle metadatajobsstores the latest canonical job attributesjob_snapshotsstores the run-level dataframe payloadjob_enrichmentsstores structured OpenAI audit data when present- CSV exports are downstream artifacts written from persisted data
Use build_sqlite_storage_url() for a managed default URL, or inject SQLiteScrapeStorage(storage_url=...) into DailyScrapeService when you need a custom local path or DSN.
Root Runtime Scripts
main.py and process_ds_jobs.py remain available as direct runtime entrypoints for the daily and once workflows.