Skip to content

API Reference

Application

linkedin_web_scraper.application.linkedin_job_scraper

High-level orchestration for the LinkedIn scraping pipeline.

LinkedInJobScraper

Coordinate scrape, cleaning, classification, and optional enrichment.

The class owns the end-to-end library workflow for a single scrape request: scrape search results, normalize the dataframe, optionally filter titles, fetch detail pages, and optionally enrich descriptions through OpenAI.

__init__(logger, config, *, job_scraper=None, job_data_cleaner=None, openai_handler=None)

Build a scraper pipeline with optional dependency overrides.

run()

Run the end-to-end scrape pipeline and return the resulting dataframe.

scrape_jobs()

Scrape jobs from LinkedIn using JobScraper.

clean_jobs(scraped_jobs)

Clean the scraped job data using JobDataCleaner.

classify_jobs(cleaned_jobs)

Classify job titles using JobTitleClassifier.

fetch_job_details(classified_jobs)

Fetch job-detail pages for the classified job dataframe.

clean_job_details(jobs_with_details)

Clean the extracted job details before optional enrichment.

enrich_jobs_with_descriptions(cleaned_jobs_with_details)

Enrich job data by processing job descriptions with OpenAI.

final_processing(enriched_jobs)

Perform final processing on the enriched job data.

linkedin_web_scraper.application.daily_scrape_service

DailyScrapeService

Coordinate city-level daily scrapes, persistence, and CSV exports.

run_for_location(*, position='Data Scientist', location='Monterrey', openai_enabled=False, openai_model=DEFAULT_OPENAI_MODEL, time_posted=TimePosted.DAY, remote_types=DEFAULT_REMOTE_TYPES, file_name=None, output_dir=None, append=True)

Run the configured scrape for one location across remote variants.

run_daily(*, cities=DEFAULT_DAILY_CITIES, position='Data Scientist', openai_enabled=False, openai_model=DEFAULT_OPENAI_MODEL, time_posted=TimePosted.DAY, output_dir=None, combined_file_name=None)

Run the default daily scrape across multiple cities and save a combined CSV.

format_jobs_output_name(position, location)

Build the stable CSV name used for daily city outputs.

resolve_output_path(file_name, output_dir=None)

Resolve an output file name under the managed jobs directory by default.

linkedin_web_scraper.application.storage

Application-layer storage contracts for persisted scrape runs.

ScrapeRunContext dataclass

Describe one persisted scrape run at the application boundary.

ScrapeStorage

Bases: Protocol

Protocol for run-scoped scrape persistence implementations.

begin_run(context)

Create and return a persisted run identifier.

store_jobs(run_id, df_jobs)

Persist the dataframe for a previously created scrape run.

load_run_jobs(run_id)

Load the persisted dataframe for one scrape run.

finish_run(run_id, *, status='completed', output_path=None, error_message=None, row_count=None)

Mark the scrape run as finished and optionally attach result metadata.

linkedin_web_scraper.application.runtime_runner

Application-level runtime helpers for CLI and scheduled executions.

RuntimeRunner

Execute runtime-configured scrape and export workflows.

build_storage_url(runtime_config)

Resolve the SQLite storage URL from runtime settings.

describe_once(runtime_config)

Describe the resolved single-location scrape plan.

describe_daily(runtime_config)

Describe the resolved daily multi-city scrape plan.

describe_export(runtime_config)

Describe the resolved CSV export plan for a persisted scrape run.

run_once(runtime_config)

Run one configured location scrape through the daily service.

run_daily(runtime_config)

Run the configured daily multi-city scrape workflow.

export_run(runtime_config)

Export a persisted scrape run to a CSV file.

Configuration

linkedin_web_scraper.config.job_scraper_config

JobScraperConfig dataclass

Typed runtime configuration for a single LinkedIn scrape.

linkedin_web_scraper.config.job_scraper_advanced_config

JobScraperAdvancedConfig dataclass

Optional advanced configuration overrides for scraping and enrichment.

from_collections(*, location_mapping=None, keywords=None, skills_categories=None) classmethod

Build a config from generic mapping and sequence inputs.

linkedin_web_scraper.config.job_scraper_config_factory

JobScraperConfigFactory

Factory helpers for constructing normalized scraper config objects.

create(position, location, openai_enabled=False, openai_model=DEFAULT_OPENAI_MODEL, time_posted=TimePosted.DAY, remote=RemoteType.ALL, *, distance=10, advanced_config=None) staticmethod

Build a normalized scraper configuration from user-facing inputs.

linkedin_web_scraper.config.openai

OpenAI-related runtime defaults for optional enrichment.

linkedin_web_scraper.config.storage

Storage-related runtime defaults and helpers.

build_sqlite_storage_url(file_name=DEFAULT_SQLITE_DB_FILE, state_dir=None)

Build a SQLite URL rooted in the managed state directory by default.

linkedin_web_scraper.config.runtime

Runtime configuration models and TOML loading helpers for CLI workflows.

LoggingRuntimeConfig dataclass

Runtime logging settings for CLI and script entrypoints.

StorageRuntimeConfig dataclass

Runtime storage settings for persisted scrape runs.

ScrapeOnceRuntimeConfig dataclass

Runtime defaults for a single-location scrape command.

ScrapeDailyRuntimeConfig dataclass

Runtime defaults for the multi-city daily scrape command.

ExportRuntimeConfig dataclass

Runtime defaults for exporting persisted run data to CSV.

RuntimeConfig dataclass

Top-level runtime configuration for CLI and scheduled runs.

runtime_config_from_mapping(data)

Build a runtime config object from TOML-compatible nested mappings.

apply_environment_overrides(config, environ=None)

Apply lightweight environment overrides on top of a loaded runtime config.

resolve_runtime_config_path(path=None, *, environ=None)

Resolve the runtime config path from an explicit value or environment.

load_runtime_config(path=None, *, environ=None)

Load runtime config from TOML and apply supported environment overrides.

linkedin_web_scraper.config.options

TimePosted

Bases: StrEnum

Supported LinkedIn time-posted filter values.

RemoteType

Bases: StrEnum

Supported LinkedIn remote-work filter values.

Domain

linkedin_web_scraper.domain.job_data_cleaner

Dataframe normalization helpers for raw and enriched LinkedIn job data.

JobDataCleaner

Clean and normalize raw LinkedIn job data frames.

clean_jobs_dataframe(df, location_mapping)

Clean raw scrape output into a normalized jobs dataframe.

process_location_data(df, location_mapping)

Clean the Location column and apply location-specific transformations.

process_urls_and_job_ids(df)

Truncate URLs and extract JobIDs from the URLs.

filter_valid_job_ids(df)

Remove rows with invalid JobIDs (not 10 digits).

remove_duplicate_job_ids(df)

Find and remove duplicate JobIDs.

remove_duplicates_by_columns(df)

Remove duplicates based on Location, Title, and Company.

clean_extracted_job_data(df_jobs)

Clean and normalize extracted job details.

clean_num_applicants(df_jobs)

Clean and standardize the number of applicants.

clean_seniority_level(df_jobs)

Clean up the 'SeniorityLevel' column.

standardize_employment_type(df_jobs)

Standardize 'EmploymentType' as a categorical variable.

standardize_job_function(df_jobs)

Standardize the 'JobFunction' column.

split_job_functions(df_jobs)

Split 'JobFunction' into three separate columns.

convert_posted_time(df_jobs)

Convert 'PostedTime' into dates.

reorder_columns(df_jobs)

Reorder the columns in the dataframe for better organization.

process_enriched_job_data(df_jobs, tech_stack_categories=None)

Post-process enriched job data after description enrichment.

extract_min_years(df_jobs)

Extract the minimum number of years from the experience string.

categorize_studies(df_jobs)

Categorize the minimum level of studies.

categorize_tech_stack(df_jobs, tech_stack_categories)

Categorize tech stack values into predefined groups.

linkedin_web_scraper.domain.job_title_classifier

Title-filtering helpers for narrowing results to relevant jobs.

JobTitleClassifier

Filter scraped jobs to titles related to the requested position.

classify_title(df_jobs)

Classify job titles based on keywords and filter out unrelated jobs.

Infrastructure

linkedin_web_scraper.infra.logging

SupportsLogAttribute

Bases: Protocol

Protocol for legacy logger wrappers that expose a .log logger.

Logger

Backward-compatible logger facade that configures package logging once.

parse_log_level(level)

Normalize string and integer log levels to logging constants.

get_logger(name=None)

Return a package logger or a named descendant logger.

resolve_logger(logger=None, *, name=None)

Resolve stdlib loggers and legacy wrappers to a standard logger instance.

configure_logging(filename=None, *, level=logging.INFO, stream=None, logger_name=PACKAGE_LOGGER_NAME, force=True, format_string=DEFAULT_LOG_FORMAT)

Configure package logging for scripts and CLI entrypoints.

linkedin_web_scraper.infra.paths

resolve_jobs_output_path(file_name, output_dir=None)

Resolve a jobs CSV path under the managed jobs artifact directory by default.

resolve_log_path(file_name, log_dir=None)

Resolve a log file path under the managed logs artifact directory by default.

resolve_state_path(file_name, state_dir=None)

Resolve a state or database path under the managed state directory by default.

HTTP

linkedin_web_scraper.infra.http.policy

HttpRequestPolicy dataclass

Retry, timeout, and header policy for HTTP scraping requests.

with_overrides(*, timeout=None, max_retries=None, initial_backoff=None, max_backoff=None, retryable_status_codes=None)

Return a copy of the policy with selected fields replaced.

linkedin_web_scraper.infra.http.utils

Low-level HTTP request helpers used by the scraper layer.

get_random_header(headers=None)

Return a random user-agent header from the configured list.

fetch_until_success(url, logger=None, max_retries=None, backoff_time=None, *, session=None, timeout=None, policy=None, sleep=None)

Attempt to fetch a URL until success or the retry budget is exhausted.

Returns the response on HTTP 200. Returns None for non-retryable responses, terminal request failures, or when the retry budget is exhausted.

linkedin_web_scraper.infra.http.job_scraper

HTTP-backed LinkedIn result-page and detail-page scraping helpers.

JobScraper

HTTP-facing scraper for LinkedIn job result pages and detail pages.

__init__(config, logger=None, *, session=None, request_timeout=None, request_policy=None)

Initialize the scraper with config, logging, and HTTP dependencies.

scrape_jobs()

Scrape jobs from LinkedIn across multiple pages.

fetch_total_jobs()

Fetch and return the total number of jobs available for the search criteria.

generate_main_url()

Generate the main LinkedIn job search URL with the specified filters.

generate_paginated_url(start)

Generate the paginated URL for fetching jobs from LinkedIn.

parse_job_data(html_content)

Parse the job data from the HTML content and add it to the jobs list.

extract_job_info(job)

Extract job information from a single job listing.

fetch_job_details(df_jobs)

Fetch detailed job information for each job posting.

get_jobid_information(jobid)

Generate the URL to fetch detailed job posting data based on job ID.

OpenAI

linkedin_web_scraper.infra.openai.models

Typed models and protocols for optional OpenAI enrichment.

OpenAIEnrichmentConfig dataclass

Runtime configuration for OpenAI-backed enrichment.

JobDescriptionEnrichment dataclass

Structured job-description enrichment output.

tech_stack_text property

Return the tech stack as a comma-separated string for dataframe storage.

english_requirement_text property

Return a dataframe-friendly English requirement value.

to_legacy_dict()

Convert the structured result into the legacy JSON-compatible shape.

JobDescriptionEnricher

Bases: Protocol

Protocol for adapters that can enrich a raw job description.

extract_job_description(description)

Return a structured enrichment result for a single job description.

linkedin_web_scraper.infra.openai.openai_handler

Optional OpenAI adapter built on the Responses API.

OpenAIConfigurationError

Bases: RuntimeError

Raised when OpenAI enrichment is requested without valid configuration.

OpenAIDependencyError

Bases: RuntimeError

Raised when optional OpenAI enrichment dependencies are missing.

OpenAIHandler

Handle optional OpenAI job-description enrichment using structured parsing.

create_messages(description)

Create a compatibility prompt payload for a raw job description.

extract_job_description(description)

Return structured enrichment data for a single job description.

generate_chat_completion(messages)

Compatibility wrapper that returns the historical JSON-compatible shape.

linkedin_web_scraper.infra.openai.job_description_processor

Helpers for applying structured OpenAI enrichment to job dataframes.

JobDescriptionProcessor

Use a structured enricher to enrich scraped job descriptions.

process_job_descriptions(df_jobs)

Process job descriptions and append parsed fields without failing the scrape.

Storage

linkedin_web_scraper.infra.storage.models

SQLAlchemy models for persisted scrape runs and job snapshots.

Base

Bases: DeclarativeBase

Base declarative model for the SQLite scrape storage schema.

ScrapeRunRecord

Bases: Base

Persist one application-level scrape run.

JobRecord

Bases: Base

Persist the latest known canonical attributes for one job ID.

JobSnapshotRecord

Bases: Base

Persist one job row as it appeared in a specific scrape run.

JobEnrichmentRecord

Bases: Base

Persist the structured OpenAI enrichment values for one run and job.

utcnow()

Return a timezone-aware UTC timestamp for persisted records.

linkedin_web_scraper.infra.storage.sqlite

SQLite-backed storage adapter for persisted scrape runs.

SQLiteScrapeStorage

Bases: ScrapeStorage

Persist scrape runs, job snapshots, and enrichments to SQLite.

begin_run(context)

Create a persisted scrape run and return its run identifier.

store_jobs(run_id, df_jobs)

Persist the dataframe rows for a scrape run.

load_run_jobs(run_id)

Load the persisted dataframe rows for one scrape run.

finish_run(run_id, *, status='completed', output_path=None, error_message=None, row_count=None)

Mark a persisted scrape run as finished.