Building a Self-Healing CSS Selector Repair System
This article describes a Python sidecar that automatically fixes broken CSS selectors in a web scraper by using a local language model to propose new selector candidates and validate them against the live HTML.
Why it matters
This system helps reduce the operational overhead of maintaining fragile web scrapers by automating the selector repair process, improving reliability and reducing manual intervention.
Key Points
- 1Scraper failures due to fragile CSS selectors are a recurring operational problem
- 2The sidecar publishes repair jobs to a Redis queue when the scraper fails
- 3A local language model proposes new selector candidates, which are then tested against the HTML
- 4Validated selectors are automatically written to the database, no redeploy required
- 5The system has design principles like LLM as proposer, not decider, and escalation over hallucination
Details
The article discusses the problem of fragile CSS selectors in production web scrapers, where changes to a third-party website's DOM can silently break the scraper. The author presents a Python sidecar system that automatically fixes these issues. When the scraper fails to extract a field, it publishes a repair job to a Redis queue. The sidecar picks up the job, fetches the current HTML, and sends a prompt to a local language model running on the device via MLX. The LLM proposes new CSS/XPath selector candidates with confidence scores and reasoning. Each candidate is then tested against the live HTML using BeautifulSoup and lxml, and the extracted value is validated against a type schema. If a candidate passes, the new selector is written directly to the database, and the next scraper run uses the updated configuration. The system has design principles like treating the LLM as a proposer, not a decider, and escalating to human intervention for cases it cannot automatically resolve.
No comments yet
Be the first to comment