API Reference
Application
linkedin_web_scraper.application.linkedin_job_scraper
High-level orchestration for the LinkedIn scraping pipeline.
LinkedInJobScraper
Coordinate scrape, cleaning, classification, and optional enrichment.
The class owns the end-to-end library workflow for a single scrape request: scrape search results, normalize the dataframe, optionally filter titles, fetch detail pages, and optionally enrich descriptions through OpenAI.
__init__(logger, config, *, job_scraper=None, job_data_cleaner=None, openai_handler=None)
Build a scraper pipeline with optional dependency overrides.
run()
Run the end-to-end scrape pipeline and return the resulting dataframe.
scrape_jobs()
Scrape jobs from LinkedIn using JobScraper.
clean_jobs(scraped_jobs)
Clean the scraped job data using JobDataCleaner.
classify_jobs(cleaned_jobs)
Classify job titles using JobTitleClassifier.
fetch_job_details(classified_jobs)
Fetch job-detail pages for the classified job dataframe.
clean_job_details(jobs_with_details)
Clean the extracted job details before optional enrichment.
enrich_jobs_with_descriptions(cleaned_jobs_with_details)
Enrich job data by processing job descriptions with OpenAI.
final_processing(enriched_jobs)
Perform final processing on the enriched job data.
linkedin_web_scraper.application.daily_scrape_service
DailyScrapeService
Coordinate city-level daily scrapes, persistence, and CSV exports.
run_for_location(*, position='Data Scientist', location='Monterrey', openai_enabled=False, openai_model=DEFAULT_OPENAI_MODEL, time_posted=TimePosted.DAY, remote_types=DEFAULT_REMOTE_TYPES, file_name=None, output_dir=None, append=True)
Run the configured scrape for one location across remote variants.
run_daily(*, cities=DEFAULT_DAILY_CITIES, position='Data Scientist', openai_enabled=False, openai_model=DEFAULT_OPENAI_MODEL, time_posted=TimePosted.DAY, output_dir=None, combined_file_name=None)
Run the default daily scrape across multiple cities and save a combined CSV.
format_jobs_output_name(position, location)
Build the stable CSV name used for daily city outputs.
resolve_output_path(file_name, output_dir=None)
Resolve an output file name under the managed jobs directory by default.
linkedin_web_scraper.application.storage
Application-layer storage contracts for persisted scrape runs.
ScrapeRunContext
dataclass
Describe one persisted scrape run at the application boundary.
ScrapeStorage
Bases: Protocol
Protocol for run-scoped scrape persistence implementations.
begin_run(context)
Create and return a persisted run identifier.
store_jobs(run_id, df_jobs)
Persist the dataframe for a previously created scrape run.
load_run_jobs(run_id)
Load the persisted dataframe for one scrape run.
finish_run(run_id, *, status='completed', output_path=None, error_message=None, row_count=None)
Mark the scrape run as finished and optionally attach result metadata.
linkedin_web_scraper.application.runtime_runner
Application-level runtime helpers for CLI and scheduled executions.
RuntimeRunner
Execute runtime-configured scrape and export workflows.
build_storage_url(runtime_config)
Resolve the SQLite storage URL from runtime settings.
describe_once(runtime_config)
Describe the resolved single-location scrape plan.
describe_daily(runtime_config)
Describe the resolved daily multi-city scrape plan.
describe_export(runtime_config)
Describe the resolved CSV export plan for a persisted scrape run.
run_once(runtime_config)
Run one configured location scrape through the daily service.
run_daily(runtime_config)
Run the configured daily multi-city scrape workflow.
export_run(runtime_config)
Export a persisted scrape run to a CSV file.
Configuration
linkedin_web_scraper.config.job_scraper_config
JobScraperConfig
dataclass
Typed runtime configuration for a single LinkedIn scrape.
linkedin_web_scraper.config.job_scraper_advanced_config
JobScraperAdvancedConfig
dataclass
Optional advanced configuration overrides for scraping and enrichment.
from_collections(*, location_mapping=None, keywords=None, skills_categories=None)
classmethod
Build a config from generic mapping and sequence inputs.
linkedin_web_scraper.config.job_scraper_config_factory
JobScraperConfigFactory
Factory helpers for constructing normalized scraper config objects.
create(position, location, openai_enabled=False, openai_model=DEFAULT_OPENAI_MODEL, time_posted=TimePosted.DAY, remote=RemoteType.ALL, *, distance=10, advanced_config=None)
staticmethod
Build a normalized scraper configuration from user-facing inputs.
linkedin_web_scraper.config.openai
OpenAI-related runtime defaults for optional enrichment.
linkedin_web_scraper.config.storage
Storage-related runtime defaults and helpers.
build_sqlite_storage_url(file_name=DEFAULT_SQLITE_DB_FILE, state_dir=None)
Build a SQLite URL rooted in the managed state directory by default.
linkedin_web_scraper.config.runtime
Runtime configuration models and TOML loading helpers for CLI workflows.
LoggingRuntimeConfig
dataclass
Runtime logging settings for CLI and script entrypoints.
StorageRuntimeConfig
dataclass
Runtime storage settings for persisted scrape runs.
ScrapeOnceRuntimeConfig
dataclass
Runtime defaults for a single-location scrape command.
ScrapeDailyRuntimeConfig
dataclass
Runtime defaults for the multi-city daily scrape command.
ExportRuntimeConfig
dataclass
Runtime defaults for exporting persisted run data to CSV.
RuntimeConfig
dataclass
Top-level runtime configuration for CLI and scheduled runs.
runtime_config_from_mapping(data)
Build a runtime config object from TOML-compatible nested mappings.
apply_environment_overrides(config, environ=None)
Apply lightweight environment overrides on top of a loaded runtime config.
resolve_runtime_config_path(path=None, *, environ=None)
Resolve the runtime config path from an explicit value or environment.
load_runtime_config(path=None, *, environ=None)
Load runtime config from TOML and apply supported environment overrides.
linkedin_web_scraper.config.options
TimePosted
Bases: StrEnum
Supported LinkedIn time-posted filter values.
RemoteType
Bases: StrEnum
Supported LinkedIn remote-work filter values.
Domain
linkedin_web_scraper.domain.job_data_cleaner
Dataframe normalization helpers for raw and enriched LinkedIn job data.
JobDataCleaner
Clean and normalize raw LinkedIn job data frames.
clean_jobs_dataframe(df, location_mapping)
Clean raw scrape output into a normalized jobs dataframe.
process_location_data(df, location_mapping)
Clean the Location column and apply location-specific transformations.
process_urls_and_job_ids(df)
Truncate URLs and extract JobIDs from the URLs.
filter_valid_job_ids(df)
Remove rows with invalid JobIDs (not 10 digits).
remove_duplicate_job_ids(df)
Find and remove duplicate JobIDs.
remove_duplicates_by_columns(df)
Remove duplicates based on Location, Title, and Company.
clean_extracted_job_data(df_jobs)
Clean and normalize extracted job details.
clean_num_applicants(df_jobs)
Clean and standardize the number of applicants.
clean_seniority_level(df_jobs)
Clean up the 'SeniorityLevel' column.
standardize_employment_type(df_jobs)
Standardize 'EmploymentType' as a categorical variable.
standardize_job_function(df_jobs)
Standardize the 'JobFunction' column.
split_job_functions(df_jobs)
Split 'JobFunction' into three separate columns.
convert_posted_time(df_jobs)
Convert 'PostedTime' into dates.
reorder_columns(df_jobs)
Reorder the columns in the dataframe for better organization.
process_enriched_job_data(df_jobs, tech_stack_categories=None)
Post-process enriched job data after description enrichment.
extract_min_years(df_jobs)
Extract the minimum number of years from the experience string.
categorize_studies(df_jobs)
Categorize the minimum level of studies.
categorize_tech_stack(df_jobs, tech_stack_categories)
Categorize tech stack values into predefined groups.
linkedin_web_scraper.domain.job_title_classifier
Title-filtering helpers for narrowing results to relevant jobs.
JobTitleClassifier
Filter scraped jobs to titles related to the requested position.
classify_title(df_jobs)
Classify job titles based on keywords and filter out unrelated jobs.
Infrastructure
linkedin_web_scraper.infra.logging
SupportsLogAttribute
Bases: Protocol
Protocol for legacy logger wrappers that expose a .log logger.
Logger
Backward-compatible logger facade that configures package logging once.
parse_log_level(level)
Normalize string and integer log levels to logging constants.
get_logger(name=None)
Return a package logger or a named descendant logger.
resolve_logger(logger=None, *, name=None)
Resolve stdlib loggers and legacy wrappers to a standard logger instance.
configure_logging(filename=None, *, level=logging.INFO, stream=None, logger_name=PACKAGE_LOGGER_NAME, force=True, format_string=DEFAULT_LOG_FORMAT)
Configure package logging for scripts and CLI entrypoints.
linkedin_web_scraper.infra.paths
resolve_jobs_output_path(file_name, output_dir=None)
Resolve a jobs CSV path under the managed jobs artifact directory by default.
resolve_log_path(file_name, log_dir=None)
Resolve a log file path under the managed logs artifact directory by default.
resolve_state_path(file_name, state_dir=None)
Resolve a state or database path under the managed state directory by default.
HTTP
linkedin_web_scraper.infra.http.policy
HttpRequestPolicy
dataclass
Retry, timeout, and header policy for HTTP scraping requests.
with_overrides(*, timeout=None, max_retries=None, initial_backoff=None, max_backoff=None, retryable_status_codes=None)
Return a copy of the policy with selected fields replaced.
linkedin_web_scraper.infra.http.utils
Low-level HTTP request helpers used by the scraper layer.
get_random_header(headers=None)
Return a random user-agent header from the configured list.
fetch_until_success(url, logger=None, max_retries=None, backoff_time=None, *, session=None, timeout=None, policy=None, sleep=None)
Attempt to fetch a URL until success or the retry budget is exhausted.
Returns the response on HTTP 200. Returns None for non-retryable
responses, terminal request failures, or when the retry budget is exhausted.
linkedin_web_scraper.infra.http.job_scraper
HTTP-backed LinkedIn result-page and detail-page scraping helpers.
JobScraper
HTTP-facing scraper for LinkedIn job result pages and detail pages.
__init__(config, logger=None, *, session=None, request_timeout=None, request_policy=None)
Initialize the scraper with config, logging, and HTTP dependencies.
scrape_jobs()
Scrape jobs from LinkedIn across multiple pages.
fetch_total_jobs()
Fetch and return the total number of jobs available for the search criteria.
generate_main_url()
Generate the main LinkedIn job search URL with the specified filters.
generate_paginated_url(start)
Generate the paginated URL for fetching jobs from LinkedIn.
parse_job_data(html_content)
Parse the job data from the HTML content and add it to the jobs list.
extract_job_info(job)
Extract job information from a single job listing.
fetch_job_details(df_jobs)
Fetch detailed job information for each job posting.
get_jobid_information(jobid)
Generate the URL to fetch detailed job posting data based on job ID.
OpenAI
linkedin_web_scraper.infra.openai.models
Typed models and protocols for optional OpenAI enrichment.
OpenAIEnrichmentConfig
dataclass
Runtime configuration for OpenAI-backed enrichment.
JobDescriptionEnrichment
dataclass
Structured job-description enrichment output.
tech_stack_text
property
Return the tech stack as a comma-separated string for dataframe storage.
english_requirement_text
property
Return a dataframe-friendly English requirement value.
to_legacy_dict()
Convert the structured result into the legacy JSON-compatible shape.
JobDescriptionEnricher
Bases: Protocol
Protocol for adapters that can enrich a raw job description.
extract_job_description(description)
Return a structured enrichment result for a single job description.
linkedin_web_scraper.infra.openai.openai_handler
Optional OpenAI adapter built on the Responses API.
OpenAIConfigurationError
Bases: RuntimeError
Raised when OpenAI enrichment is requested without valid configuration.
OpenAIDependencyError
Bases: RuntimeError
Raised when optional OpenAI enrichment dependencies are missing.
OpenAIHandler
Handle optional OpenAI job-description enrichment using structured parsing.
create_messages(description)
Create a compatibility prompt payload for a raw job description.
extract_job_description(description)
Return structured enrichment data for a single job description.
generate_chat_completion(messages)
Compatibility wrapper that returns the historical JSON-compatible shape.
linkedin_web_scraper.infra.openai.job_description_processor
Helpers for applying structured OpenAI enrichment to job dataframes.
JobDescriptionProcessor
Use a structured enricher to enrich scraped job descriptions.
process_job_descriptions(df_jobs)
Process job descriptions and append parsed fields without failing the scrape.
Storage
linkedin_web_scraper.infra.storage.models
SQLAlchemy models for persisted scrape runs and job snapshots.
Base
Bases: DeclarativeBase
Base declarative model for the SQLite scrape storage schema.
ScrapeRunRecord
JobRecord
JobSnapshotRecord
JobEnrichmentRecord
utcnow()
Return a timezone-aware UTC timestamp for persisted records.
linkedin_web_scraper.infra.storage.sqlite
SQLite-backed storage adapter for persisted scrape runs.
SQLiteScrapeStorage
Bases: ScrapeStorage
Persist scrape runs, job snapshots, and enrichments to SQLite.
begin_run(context)
Create a persisted scrape run and return its run identifier.
store_jobs(run_id, df_jobs)
Persist the dataframe rows for a scrape run.
load_run_jobs(run_id)
Load the persisted dataframe rows for one scrape run.
finish_run(run_id, *, status='completed', output_path=None, error_message=None, row_count=None)
Mark a persisted scrape run as finished.