API Reference

Application

`linkedin_web_scraper.application.linkedin_job_scraper`

High-level orchestration for the LinkedIn scraping pipeline.

`LinkedInJobScraper`

Coordinate scrape, cleaning, classification, and optional enrichment.

The class owns the end-to-end library workflow for a single scrape request: scrape search results, normalize the dataframe, optionally filter titles, fetch detail pages, and optionally enrich descriptions through OpenAI.

`init(logger, config, *, job_scraper=None, job_data_cleaner=None, openai_handler=None)`

Build a scraper pipeline with optional dependency overrides.

`run()`

Run the end-to-end scrape pipeline and return the resulting dataframe.

`scrape_jobs()`

Scrape jobs from LinkedIn using JobScraper.

`clean_jobs(scraped_jobs)`

Clean the scraped job data using JobDataCleaner.

`classify_jobs(cleaned_jobs)`

Classify job titles using JobTitleClassifier.

`fetch_job_details(classified_jobs)`

Fetch job-detail pages for the classified job dataframe.

`clean_job_details(jobs_with_details)`

Clean the extracted job details before optional enrichment.

`enrich_jobs_with_descriptions(cleaned_jobs_with_details)`

Enrich job data by processing job descriptions with OpenAI.

`final_processing(enriched_jobs)`

Perform final processing on the enriched job data.

`linkedin_web_scraper.application.daily_scrape_service`

`DailyScrapeService`

Coordinate city-level daily scrapes, persistence, and CSV exports.

`run_for_location(*, position='Data Scientist', location='Monterrey', openai_enabled=False, openai_model=DEFAULT_OPENAI_MODEL, time_posted=TimePosted.DAY, remote_types=DEFAULT_REMOTE_TYPES, file_name=None, output_dir=None, append=True)`

Run the configured scrape for one location across remote variants.

`run_daily(*, cities=DEFAULT_DAILY_CITIES, position='Data Scientist', openai_enabled=False, openai_model=DEFAULT_OPENAI_MODEL, time_posted=TimePosted.DAY, output_dir=None, combined_file_name=None)`

Run the default daily scrape across multiple cities and save a combined CSV.

`format_jobs_output_name(position, location)`

Build the stable CSV name used for daily city outputs.

`resolve_output_path(file_name, output_dir=None)`

Resolve an output file name under the managed jobs directory by default.

`linkedin_web_scraper.application.storage`

Application-layer storage contracts for persisted scrape runs.

`ScrapeRunContext` `dataclass`

Describe one persisted scrape run at the application boundary.

`ScrapeStorage`

Bases: Protocol

Protocol for run-scoped scrape persistence implementations.

`begin_run(context)`

Create and return a persisted run identifier.

`store_jobs(run_id, df_jobs)`

Persist the dataframe for a previously created scrape run.

`load_run_jobs(run_id)`

Load the persisted dataframe for one scrape run.

`finish_run(run_id, *, status='completed', output_path=None, error_message=None, row_count=None)`

Mark the scrape run as finished and optionally attach result metadata.

`linkedin_web_scraper.application.runtime_runner`

Application-level runtime helpers for CLI and scheduled executions.

`RuntimeRunner`

Execute runtime-configured scrape and export workflows.

`build_storage_url(runtime_config)`

Resolve the SQLite storage URL from runtime settings.

`describe_once(runtime_config)`

Describe the resolved single-location scrape plan.

`describe_daily(runtime_config)`

Describe the resolved daily multi-city scrape plan.

`describe_export(runtime_config)`

Describe the resolved CSV export plan for a persisted scrape run.

`run_once(runtime_config)`

Run one configured location scrape through the daily service.

`run_daily(runtime_config)`

Run the configured daily multi-city scrape workflow.

`export_run(runtime_config)`

Export a persisted scrape run to a CSV file.

Configuration

`linkedin_web_scraper.config.job_scraper_config`

`JobScraperConfig` `dataclass`

Typed runtime configuration for a single LinkedIn scrape.

`linkedin_web_scraper.config.job_scraper_advanced_config`

`JobScraperAdvancedConfig` `dataclass`

Optional advanced configuration overrides for scraping and enrichment.

`from_collections(*, location_mapping=None, keywords=None, skills_categories=None)` `classmethod`

Build a config from generic mapping and sequence inputs.

`linkedin_web_scraper.config.job_scraper_config_factory`

`JobScraperConfigFactory`

Factory helpers for constructing normalized scraper config objects.

`create(position, location, openai_enabled=False, openai_model=DEFAULT_OPENAI_MODEL, time_posted=TimePosted.DAY, remote=RemoteType.ALL, *, distance=10, advanced_config=None)` `staticmethod`

Build a normalized scraper configuration from user-facing inputs.

`linkedin_web_scraper.config.openai`

OpenAI-related runtime defaults for optional enrichment.

`linkedin_web_scraper.config.storage`

Storage-related runtime defaults and helpers.

`build_sqlite_storage_url(file_name=DEFAULT_SQLITE_DB_FILE, state_dir=None)`

Build a SQLite URL rooted in the managed state directory by default.

`linkedin_web_scraper.config.runtime`

Runtime configuration models and TOML loading helpers for CLI workflows.

`LoggingRuntimeConfig` `dataclass`

Runtime logging settings for CLI and script entrypoints.

`StorageRuntimeConfig` `dataclass`

Runtime storage settings for persisted scrape runs.

`ScrapeOnceRuntimeConfig` `dataclass`

Runtime defaults for a single-location scrape command.

`ScrapeDailyRuntimeConfig` `dataclass`

Runtime defaults for the multi-city daily scrape command.

`ExportRuntimeConfig` `dataclass`

Runtime defaults for exporting persisted run data to CSV.

`RuntimeConfig` `dataclass`

Top-level runtime configuration for CLI and scheduled runs.

`runtime_config_from_mapping(data)`

Build a runtime config object from TOML-compatible nested mappings.

`apply_environment_overrides(config, environ=None)`

Apply lightweight environment overrides on top of a loaded runtime config.

`resolve_runtime_config_path(path=None, *, environ=None)`

Resolve the runtime config path from an explicit value or environment.

`load_runtime_config(path=None, *, environ=None)`

Load runtime config from TOML and apply supported environment overrides.

`linkedin_web_scraper.config.options`

`TimePosted`

Bases: StrEnum

Supported LinkedIn time-posted filter values.

`RemoteType`

Bases: StrEnum

Supported LinkedIn remote-work filter values.

Domain

`linkedin_web_scraper.domain.job_data_cleaner`

Dataframe normalization helpers for raw and enriched LinkedIn job data.

`JobDataCleaner`

Clean and normalize raw LinkedIn job data frames.

`clean_jobs_dataframe(df, location_mapping)`

Clean raw scrape output into a normalized jobs dataframe.

`process_location_data(df, location_mapping)`

Clean the Location column and apply location-specific transformations.

`process_urls_and_job_ids(df)`

Truncate URLs and extract JobIDs from the URLs.

`filter_valid_job_ids(df)`

Remove rows with invalid JobIDs (not 10 digits).

`remove_duplicate_job_ids(df)`

Find and remove duplicate JobIDs.

`remove_duplicates_by_columns(df)`

Remove duplicates based on Location, Title, and Company.

`clean_extracted_job_data(df_jobs)`

Clean and normalize extracted job details.

`clean_num_applicants(df_jobs)`

Clean and standardize the number of applicants.

`clean_seniority_level(df_jobs)`

Clean up the 'SeniorityLevel' column.

`standardize_employment_type(df_jobs)`

Standardize 'EmploymentType' as a categorical variable.

`standardize_job_function(df_jobs)`

Standardize the 'JobFunction' column.

`split_job_functions(df_jobs)`

Split 'JobFunction' into three separate columns.

`convert_posted_time(df_jobs)`

Convert 'PostedTime' into dates.

`reorder_columns(df_jobs)`

Reorder the columns in the dataframe for better organization.

`process_enriched_job_data(df_jobs, tech_stack_categories=None)`

Post-process enriched job data after description enrichment.

`extract_min_years(df_jobs)`

Extract the minimum number of years from the experience string.

`categorize_studies(df_jobs)`

Categorize the minimum level of studies.

`categorize_tech_stack(df_jobs, tech_stack_categories)`

Categorize tech stack values into predefined groups.

`linkedin_web_scraper.domain.job_title_classifier`

Title-filtering helpers for narrowing results to relevant jobs.

`JobTitleClassifier`

Filter scraped jobs to titles related to the requested position.

`classify_title(df_jobs)`

Classify job titles based on keywords and filter out unrelated jobs.

Infrastructure

`linkedin_web_scraper.infra.logging`

`SupportsLogAttribute`

Bases: Protocol

Protocol for legacy logger wrappers that expose a .log logger.

`Logger`

Backward-compatible logger facade that configures package logging once.

`parse_log_level(level)`

Normalize string and integer log levels to logging constants.

`get_logger(name=None)`

Return a package logger or a named descendant logger.

`resolve_logger(logger=None, *, name=None)`

Resolve stdlib loggers and legacy wrappers to a standard logger instance.

`configure_logging(filename=None, *, level=logging.INFO, stream=None, logger_name=PACKAGE_LOGGER_NAME, force=True, format_string=DEFAULT_LOG_FORMAT)`

Configure package logging for scripts and CLI entrypoints.

`linkedin_web_scraper.infra.paths`

`resolve_jobs_output_path(file_name, output_dir=None)`

Resolve a jobs CSV path under the managed jobs artifact directory by default.

`resolve_log_path(file_name, log_dir=None)`

Resolve a log file path under the managed logs artifact directory by default.

`resolve_state_path(file_name, state_dir=None)`

Resolve a state or database path under the managed state directory by default.

HTTP

`linkedin_web_scraper.infra.http.policy`

`HttpRequestPolicy` `dataclass`

Retry, timeout, and header policy for HTTP scraping requests.

`with_overrides(*, timeout=None, max_retries=None, initial_backoff=None, max_backoff=None, retryable_status_codes=None)`

Return a copy of the policy with selected fields replaced.

`linkedin_web_scraper.infra.http.utils`

Low-level HTTP request helpers used by the scraper layer.

`get_random_header(headers=None)`

Return a random user-agent header from the configured list.

`fetch_until_success(url, logger=None, max_retries=None, backoff_time=None, *, session=None, timeout=None, policy=None, sleep=None)`

Attempt to fetch a URL until success or the retry budget is exhausted.

Returns the response on HTTP 200. Returns None for non-retryable responses, terminal request failures, or when the retry budget is exhausted.

`linkedin_web_scraper.infra.http.job_scraper`

HTTP-backed LinkedIn result-page and detail-page scraping helpers.

`JobScraper`

HTTP-facing scraper for LinkedIn job result pages and detail pages.

`init(config, logger=None, *, session=None, request_timeout=None, request_policy=None)`

Initialize the scraper with config, logging, and HTTP dependencies.

`scrape_jobs()`

Scrape jobs from LinkedIn across multiple pages.

`fetch_total_jobs()`

Fetch and return the total number of jobs available for the search criteria.

`generate_main_url()`

Generate the main LinkedIn job search URL with the specified filters.

`generate_paginated_url(start)`

Generate the paginated URL for fetching jobs from LinkedIn.

`parse_job_data(html_content)`

Parse the job data from the HTML content and add it to the jobs list.

`extract_job_info(job)`

Extract job information from a single job listing.

`fetch_job_details(df_jobs)`

Fetch detailed job information for each job posting.

`get_jobid_information(jobid)`

Generate the URL to fetch detailed job posting data based on job ID.

OpenAI

`linkedin_web_scraper.infra.openai.models`

Typed models and protocols for optional OpenAI enrichment.

`OpenAIEnrichmentConfig` `dataclass`

Runtime configuration for OpenAI-backed enrichment.

`JobDescriptionEnrichment` `dataclass`

Structured job-description enrichment output.

`tech_stack_text` `property`

Return the tech stack as a comma-separated string for dataframe storage.

`english_requirement_text` `property`

Return a dataframe-friendly English requirement value.

`to_legacy_dict()`

Convert the structured result into the legacy JSON-compatible shape.

`JobDescriptionEnricher`

Bases: Protocol

Protocol for adapters that can enrich a raw job description.

`extract_job_description(description)`

Return a structured enrichment result for a single job description.

`linkedin_web_scraper.infra.openai.openai_handler`

Optional OpenAI adapter built on the Responses API.

`OpenAIConfigurationError`

Bases: RuntimeError

Raised when OpenAI enrichment is requested without valid configuration.

`OpenAIDependencyError`

Bases: RuntimeError

Raised when optional OpenAI enrichment dependencies are missing.

`OpenAIHandler`

Handle optional OpenAI job-description enrichment using structured parsing.

`create_messages(description)`

Create a compatibility prompt payload for a raw job description.

`extract_job_description(description)`

Return structured enrichment data for a single job description.

`generate_chat_completion(messages)`

Compatibility wrapper that returns the historical JSON-compatible shape.

`linkedin_web_scraper.infra.openai.job_description_processor`

Helpers for applying structured OpenAI enrichment to job dataframes.

`JobDescriptionProcessor`

Use a structured enricher to enrich scraped job descriptions.

`process_job_descriptions(df_jobs)`

Process job descriptions and append parsed fields without failing the scrape.

Storage

`linkedin_web_scraper.infra.storage.models`

SQLAlchemy models for persisted scrape runs and job snapshots.

`Base`

Bases: DeclarativeBase

Base declarative model for the SQLite scrape storage schema.

`ScrapeRunRecord`

Bases: Base

Persist one application-level scrape run.

`JobRecord`

Bases: Base

Persist the latest known canonical attributes for one job ID.

`JobSnapshotRecord`

Bases: Base

Persist one job row as it appeared in a specific scrape run.

`JobEnrichmentRecord`

Bases: Base

Persist the structured OpenAI enrichment values for one run and job.

`utcnow()`

Return a timezone-aware UTC timestamp for persisted records.

`linkedin_web_scraper.infra.storage.sqlite`

SQLite-backed storage adapter for persisted scrape runs.

`SQLiteScrapeStorage`

Bases: ScrapeStorage

Persist scrape runs, job snapshots, and enrichments to SQLite.

`begin_run(context)`

Create a persisted scrape run and return its run identifier.

`store_jobs(run_id, df_jobs)`

Persist the dataframe rows for a scrape run.

`load_run_jobs(run_id)`

Load the persisted dataframe rows for one scrape run.

`finish_run(run_id, *, status='completed', output_path=None, error_message=None, row_count=None)`

Mark a persisted scrape run as finished.

API Reference

Application

linkedin_web_scraper.application.linkedin_job_scraper

LinkedInJobScraper

__init__(logger, config, *, job_scraper=None, job_data_cleaner=None, openai_handler=None)

run()

scrape_jobs()

clean_jobs(scraped_jobs)

classify_jobs(cleaned_jobs)

fetch_job_details(classified_jobs)

clean_job_details(jobs_with_details)

enrich_jobs_with_descriptions(cleaned_jobs_with_details)

final_processing(enriched_jobs)

linkedin_web_scraper.application.daily_scrape_service

DailyScrapeService

run_for_location(*, position='Data Scientist', location='Monterrey', openai_enabled=False, openai_model=DEFAULT_OPENAI_MODEL, time_posted=TimePosted.DAY, remote_types=DEFAULT_REMOTE_TYPES, file_name=None, output_dir=None, append=True)

run_daily(*, cities=DEFAULT_DAILY_CITIES, position='Data Scientist', openai_enabled=False, openai_model=DEFAULT_OPENAI_MODEL, time_posted=TimePosted.DAY, output_dir=None, combined_file_name=None)

format_jobs_output_name(position, location)

resolve_output_path(file_name, output_dir=None)

linkedin_web_scraper.application.storage

ScrapeRunContext dataclass

ScrapeStorage

begin_run(context)

store_jobs(run_id, df_jobs)

load_run_jobs(run_id)

finish_run(run_id, *, status='completed', output_path=None, error_message=None, row_count=None)

linkedin_web_scraper.application.runtime_runner

RuntimeRunner

build_storage_url(runtime_config)

describe_once(runtime_config)

describe_daily(runtime_config)

describe_export(runtime_config)

run_once(runtime_config)

run_daily(runtime_config)

export_run(runtime_config)

Configuration

linkedin_web_scraper.config.job_scraper_config

JobScraperConfig dataclass

linkedin_web_scraper.config.job_scraper_advanced_config

JobScraperAdvancedConfig dataclass

from_collections(*, location_mapping=None, keywords=None, skills_categories=None) classmethod

linkedin_web_scraper.config.job_scraper_config_factory

JobScraperConfigFactory

create(position, location, openai_enabled=False, openai_model=DEFAULT_OPENAI_MODEL, time_posted=TimePosted.DAY, remote=RemoteType.ALL, *, distance=10, advanced_config=None) staticmethod

linkedin_web_scraper.config.openai

linkedin_web_scraper.config.storage

build_sqlite_storage_url(file_name=DEFAULT_SQLITE_DB_FILE, state_dir=None)

linkedin_web_scraper.config.runtime

LoggingRuntimeConfig dataclass

StorageRuntimeConfig dataclass

ScrapeOnceRuntimeConfig dataclass

ScrapeDailyRuntimeConfig dataclass

ExportRuntimeConfig dataclass

RuntimeConfig dataclass

runtime_config_from_mapping(data)

apply_environment_overrides(config, environ=None)

resolve_runtime_config_path(path=None, *, environ=None)

load_runtime_config(path=None, *, environ=None)

linkedin_web_scraper.config.options

TimePosted

RemoteType

Domain

linkedin_web_scraper.domain.job_data_cleaner

JobDataCleaner

clean_jobs_dataframe(df, location_mapping)

process_location_data(df, location_mapping)

process_urls_and_job_ids(df)

filter_valid_job_ids(df)

remove_duplicate_job_ids(df)

remove_duplicates_by_columns(df)

clean_extracted_job_data(df_jobs)

clean_num_applicants(df_jobs)

clean_seniority_level(df_jobs)

standardize_employment_type(df_jobs)

standardize_job_function(df_jobs)

split_job_functions(df_jobs)

convert_posted_time(df_jobs)

reorder_columns(df_jobs)

process_enriched_job_data(df_jobs, tech_stack_categories=None)

extract_min_years(df_jobs)

`linkedin_web_scraper.application.linkedin_job_scraper`

`LinkedInJobScraper`

`init(logger, config, *, job_scraper=None, job_data_cleaner=None, openai_handler=None)`

`run()`

`scrape_jobs()`

`clean_jobs(scraped_jobs)`

`classify_jobs(cleaned_jobs)`

`fetch_job_details(classified_jobs)`

`clean_job_details(jobs_with_details)`

`enrich_jobs_with_descriptions(cleaned_jobs_with_details)`

`final_processing(enriched_jobs)`

`linkedin_web_scraper.application.daily_scrape_service`

`DailyScrapeService`

`run_for_location(*, position='Data Scientist', location='Monterrey', openai_enabled=False, openai_model=DEFAULT_OPENAI_MODEL, time_posted=TimePosted.DAY, remote_types=DEFAULT_REMOTE_TYPES, file_name=None, output_dir=None, append=True)`

`run_daily(*, cities=DEFAULT_DAILY_CITIES, position='Data Scientist', openai_enabled=False, openai_model=DEFAULT_OPENAI_MODEL, time_posted=TimePosted.DAY, output_dir=None, combined_file_name=None)`

`format_jobs_output_name(position, location)`

`resolve_output_path(file_name, output_dir=None)`

`linkedin_web_scraper.application.storage`

`ScrapeRunContext` `dataclass`

`ScrapeStorage`

`begin_run(context)`

`store_jobs(run_id, df_jobs)`

`load_run_jobs(run_id)`

`finish_run(run_id, *, status='completed', output_path=None, error_message=None, row_count=None)`

`linkedin_web_scraper.application.runtime_runner`

`RuntimeRunner`

`build_storage_url(runtime_config)`

`describe_once(runtime_config)`

`describe_daily(runtime_config)`

`describe_export(runtime_config)`

`run_once(runtime_config)`

`run_daily(runtime_config)`

`export_run(runtime_config)`

`linkedin_web_scraper.config.job_scraper_config`

`JobScraperConfig` `dataclass`

`linkedin_web_scraper.config.job_scraper_advanced_config`

`JobScraperAdvancedConfig` `dataclass`

`from_collections(*, location_mapping=None, keywords=None, skills_categories=None)` `classmethod`

`linkedin_web_scraper.config.job_scraper_config_factory`

`JobScraperConfigFactory`

`create(position, location, openai_enabled=False, openai_model=DEFAULT_OPENAI_MODEL, time_posted=TimePosted.DAY, remote=RemoteType.ALL, *, distance=10, advanced_config=None)` `staticmethod`

`linkedin_web_scraper.config.openai`

`linkedin_web_scraper.config.storage`

`build_sqlite_storage_url(file_name=DEFAULT_SQLITE_DB_FILE, state_dir=None)`

`linkedin_web_scraper.config.runtime`

`LoggingRuntimeConfig` `dataclass`

`StorageRuntimeConfig` `dataclass`

`ScrapeOnceRuntimeConfig` `dataclass`

`ScrapeDailyRuntimeConfig` `dataclass`

`ExportRuntimeConfig` `dataclass`

`RuntimeConfig` `dataclass`

`runtime_config_from_mapping(data)`

`apply_environment_overrides(config, environ=None)`

`resolve_runtime_config_path(path=None, *, environ=None)`

`load_runtime_config(path=None, *, environ=None)`

`linkedin_web_scraper.config.options`

`TimePosted`

`RemoteType`

`linkedin_web_scraper.domain.job_data_cleaner`

`JobDataCleaner`

`clean_jobs_dataframe(df, location_mapping)`

`process_location_data(df, location_mapping)`

`process_urls_and_job_ids(df)`

`filter_valid_job_ids(df)`

`remove_duplicate_job_ids(df)`

`remove_duplicates_by_columns(df)`

`clean_extracted_job_data(df_jobs)`

`clean_num_applicants(df_jobs)`

`clean_seniority_level(df_jobs)`

`standardize_employment_type(df_jobs)`

`standardize_job_function(df_jobs)`

`split_job_functions(df_jobs)`

`convert_posted_time(df_jobs)`

`reorder_columns(df_jobs)`

`process_enriched_job_data(df_jobs, tech_stack_categories=None)`

`extract_min_years(df_jobs)`

`categorize_studies(df_jobs)`

`categorize_tech_stack(df_jobs, tech_stack_categories)`

`linkedin_web_scraper.domain.job_title_classifier`

`JobTitleClassifier`