Loading...
Loading...
Apply Web Scraping with Python practices (Ryan Mitchell). Covers First Scrapers (Ch 1: urllib, BeautifulSoup), HTML Parsing (Ch 2: find, findAll, CSS selectors, regex, lambda), Crawling (Ch 3-4: single-domain, cross-site, crawl models), Scrapy (Ch 5: spiders, items, pipelines, rules), Storing Data (Ch 6: CSV, MySQL, files, email), Reading Documents (Ch 7: PDF, Word, encoding), Cleaning Data (Ch 8: normalization, OpenRefine), NLP (Ch 9: n-grams, Markov, NLTK), Forms & Logins (Ch 10: POST, sessions, cookies), JavaScript (Ch 11: Selenium, headless, Ajax), APIs (Ch 12: REST, undocumented), Image/OCR (Ch 13: Pillow, Tesseract), Avoiding Traps (Ch 14: headers, honeypots), Testing (Ch 15: unittest, Selenium), Parallel (Ch 16: threads, processes), Remote (Ch 17: Tor, proxies), Legalities (Ch 18: robots.txt, CFAA, ethics). Trigger on "web scraping", "BeautifulSoup", "Scrapy", "crawler", "spider", "scraper", "parse HTML", "Selenium scraping", "data extraction".
npx skill4agent add zlstas/skills web-scraping-pythonreferences/practices-catalog.md| Concern | Chapters to Apply |
|---|---|
| Basic page fetching and parsing | Ch 1: urllib/requests, BeautifulSoup setup, first scraper |
| Finding elements in HTML | Ch 2: find/findAll, CSS selectors, navigating DOM trees, regex, lambda filters |
| Crawling within a site | Ch 3: Following links, building crawlers, breadth-first vs depth-first |
| Crawling across sites | Ch 4: Planning crawl models, handling different site layouts, normalizing data |
| Framework-based scraping | Ch 5: Scrapy spiders, items, pipelines, rules, CrawlSpider, logging |
| Saving scraped data | Ch 6: CSV, MySQL/database storage, downloading files, sending email |
| Non-HTML documents | Ch 7: PDF text extraction, Word docs, encoding handling |
| Data cleaning | Ch 8: String normalization, regex cleaning, OpenRefine, UTF-8 handling |
| Text analysis on scraped data | Ch 9: N-grams, Markov models, NLTK, summarization |
| Login-protected pages | Ch 10: POST requests, sessions, cookies, HTTP basic auth, handling tokens |
| JavaScript-rendered pages | Ch 11: Selenium WebDriver, headless browsers, waiting for Ajax, executing JS |
| Working with APIs | Ch 12: REST methods, JSON parsing, authentication, undocumented APIs |
| Images and OCR | Ch 13: Pillow image processing, Tesseract OCR, CAPTCHA handling |
| Avoiding detection | Ch 14: User-Agent headers, cookie handling, timing/delays, honeypot avoidance |
| Testing scrapers | Ch 15: unittest for scrapers, Selenium-based testing, handling site changes |
| Parallel scraping | Ch 16: Multithreading, multiprocessing, thread-safe queues |
| Remote/anonymous scraping | Ch 17: Tor, proxies, rotating IPs, cloud-based scraping |
| Legal and ethical concerns | Ch 18: robots.txt, Terms of Service, CFAA, copyright, ethical scraping |
User: "Scrape product listings from an e-commerce category page"
Apply: Ch 1 (fetching pages), Ch 2 (parsing product elements),
Ch 3 (pagination/crawling), Ch 6 (storing to CSV/DB)
Generate:
- requests + BeautifulSoup scraper
- CSS selector-based product extraction
- Pagination handler following next-page links
- CSV or database storage with schema
- Rate limiting and error handlingUser: "Extract data from a React single-page application"
Apply: Ch 11 (Selenium, headless browser), Ch 2 (parsing rendered HTML),
Ch 14 (avoiding detection), Ch 15 (testing)
Generate:
- Selenium WebDriver with headless Chrome
- Explicit waits for dynamic content loading
- JavaScript execution for scrolling/interaction
- Data extraction from rendered DOM
- Headless browser configurationUser: "Scrape data from a site that requires login"
Apply: Ch 10 (forms, sessions, cookies), Ch 14 (headers, tokens),
Ch 6 (data storage)
Generate:
- Session-based login with CSRF token handling
- Cookie persistence across requests
- POST request for form submission
- Authenticated page navigation
- Session expiry detection and re-loginUser: "Build a crawler to scrape thousands of pages from multiple domains"
Apply: Ch 5 (Scrapy framework), Ch 4 (crawl models),
Ch 16 (parallel scraping), Ch 14 (avoiding blocks)
Generate:
- Scrapy spider with item definitions and pipelines
- CrawlSpider with Rule and LinkExtractor
- Pipeline for database storage
- Settings for concurrent requests, delays, user agents
- Middleware for proxy rotationreferences/review-checklist.md## Summary
One paragraph: overall scraper quality, pattern adherence, main concerns.
## Fetching & Connection Issues
For each issue (Ch 1, 10-11):
- **Topic**: chapter and concept
- **Location**: where in the code
- **Problem**: what's wrong
- **Fix**: recommended change with code snippet
## Parsing & Extraction Issues
For each issue (Ch 2, 7):
- Same structure
## Crawling & Navigation Issues
For each issue (Ch 3-5):
- Same structure
## Storage & Data Issues
For each issue (Ch 6, 8):
- Same structure
## Resilience & Performance Issues
For each issue (Ch 14-16):
- Same structure
## Ethics & Legal Issues
For each issue (Ch 17-18):
- Same structure
## Testing & Quality Issues
For each issue (Ch 9, 15):
- Same structure
## Recommendations
Priority-ordered from most critical to nice-to-have.
Each recommendation references the specific chapter/concept.references/practices-catalog.mdreferences/review-checklist.md