ARTICLE AD BOX
As an ultra marathon enthusiast, I often look a communal challenge: really do I estimate my decorativeness clip for longer races I haven’t attempted yet? When discussing this pinch my coach, he suggested a applicable approach—look astatine runners who’ve completed immoderate a title I’ve done and nan title I’m targeting. This narration could proviso valuable insights into imaginable decorativeness times. But manually searching done title results would beryllium incredibly time-consuming.
This led maine to build Race Time Insights, a instrumentality that automatically compares title results by uncovering athletes who’ve completed immoderate events. The exertion scrapes title results from platforms for illustration UltraSignup and Pacific Multisports, allowing runners to input 2 title URLs and spot really different athletes performed crossed immoderate events.
Building this instrumentality showed maine conscionable really powerful DigitalOcean’s App Platform could be. Using Puppeteer pinch headless Chrome successful Docker containers, I could attraction connected solving nan problem for runners while App Platform handled each nan infrastructure complexity. The consequence was a robust, scalable solution that helps nan moving statement make data-driven decisions astir their title goals.
After building Race Time Insights, I wanted to create a line showing different developers really to leverage these aforesaid technologies—Puppeteer, Docker containers, and DigitalOcean App Platform. Of course, erstwhile moving pinch outer data, you petition to beryllium mindful of things for illustration title limiting and position of service.
Enter Project Gutenberg. With its immense postulation of nationalist domain books and clear position of service, it’s an cleanable campaigner for demonstrating these technologies. In this post, we’ll investigation really to build a book hunt exertion utilizing Puppeteer successful a Docker container, deployed connected App Platform, while pursuing champion practices for outer accusation access.
Project Gutenberg Book Search
I’ve built and shared a web exertion that responsibly scrapes book accusation from Project Gutenberg. The app, which you tin find successful this GitHub repository, allows users to hunt done thousands of nationalist domain books, position elaborate accusation astir each book, and entree various download formats. What makes this peculiarly absorbing is really it demonstrates responsible web scraping practices while providing genuine worthy to users.
Being a Good Digital Citizen
When building a web scraper, it’s important to recreation bully practices and respect immoderate method and ineligible boundaries. Project Gutenberg is an fantabulous illustration for learning these principles because:
- It has clear position of service
- It provides robots.txt guidelines
- Its contented is explicitly successful nan nationalist domain
- It benefits from accrued accessibility to its resources
Our implementation includes respective champion practices:
Rate Limiting
For objection purposes, we instrumentality a elemental title limiter that ensures astatine slightest 1 2nd betwixt requests:
const rateLimiter = { lastRequest: 0, minDelay: 1000, async wait() { const now = Date.now(); const timeToWait = Math.max(0, this.lastRequest + this.minDelay - now); if (timeToWait > 0) { await new Promise(resolve => setTimeout(resolve, timeToWait)); } this.lastRequest = Date.now(); } };
This implementation is intentionally simplified for nan example. It assumes a azygous exertion suit and stores authorities successful memory, which wouldn’t beryllium suitable for accumulation use. More robust solutions mightiness usage Redis for distributed title limiting aliases instrumentality queue-based systems for amended scalability.
This title limiter is utilized earlier each petition to Project Gutenberg:
async searchBooks(query, page = 1) { await this.initialize(); await rateLimiter.wait(); } async getBookDetails(bookUrl) { await this.initialize(); await rateLimiter.wait(); }
Clear Bot Identification
A civilization User-Agent helps website administrators understand who is accessing their tract and why. This transparency allows them to:
- Contact you if location are issues
- Monitor and analyse bot postulation separately from value users
- Potentially proviso amended entree aliases support for morganatic scrapers
await browserPage.setUserAgent('GutenbergScraper/1.0 (Educational Project)');
Efficient Resource Management
Chrome tin beryllium memory-intensive, peculiarly erstwhile moving aggregate instances. Properly closing browser pages aft usage prevents practice leaks and ensures your exertion runs efficiently, moreover erstwhile handling galore requests:
try { } finally { await browserPage.close(); }
By pursuing these practices, we create a scraper that’s immoderate effective and respectful of nan resources it accesses. This is peculiarly important erstwhile moving pinch valuable nationalist resources for illustration Project Gutenberg.
Web Scraping successful nan Cloud
The exertion leverages modern unreality architecture and containerization done DigitalOcean’s App Platform. This onslaught provides a cleanable equilibrium betwixt betterment simplicity and accumulation reliability.
The Power of App Platform
App Platform streamlines nan deployment process by handling:
- Web server configuration
- SSL certificate management
- Security updates
- Load balancing
- Resource monitoring
This allows america to attraction connected nan exertion codification while App Platform manages nan infrastructure.
Headless Chrome successful a Container
The halfway of our scraping functionality uses Puppeteer, which provides a high-level API to powerfulness Chrome programmatically. Here’s really we group up and usage Puppeteer successful our application:
const puppeteer = require('puppeteer'); class BookService { constructor() { this.baseUrl = 'https://www.gutenberg.org'; this.browser = null; } async initialize() { if (!this.browser) { console.log('Environment details:', { PUPPETEER_EXECUTABLE_PATH: process.env.PUPPETEER_EXECUTABLE_PATH, CHROME_PATH: process.env.CHROME_PATH, NODE_ENV: process.env.NODE_ENV }); const options = { headless: 'new', args: [ '--no-sandbox', '--disable-setuid-sandbox', '--disable-dev-shm-usage', '--disable-gpu', '--disable-extensions', '--disable-software-rasterizer', '--window-size=1280,800', '--user-agent=GutenbergScraper/1.0 (+https://github.com/wadewegner/doappplat-puppeteer-sample) Chromium/120.0.0.0' ], executablePath: process.env.PUPPETEER_EXECUTABLE_PATH || '/usr/bin/chromium-browser', defaultViewport: { width: 1280, height: 800 } }; this.browser = await puppeteer.launch(options); } } async searchBooks(query, page = 1) { await this.initialize(); await rateLimiter.wait(); const browserPage = await this.browser.newPage(); try { await browserPage.setExtraHTTPHeaders({ 'Accept-Language': 'en-US,en;q=0.9', 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Connection': 'keep-alive', 'Upgrade-Insecure-Requests': '1', 'X-Bot-Info': 'GutenbergScraper - A instrumentality for searching Project Gutenberg' }); const searchUrl = `${this.baseUrl}/ebooks/search/?query=${encodeURIComponent(query)}&start_index=${(page - 1) * 24}`; await browserPage.goto(searchUrl, { waitUntil: 'networkidle0' }); } finally { await browserPage.close(); } } }
This setup allows america to:
- Run Chrome successful headless mode (no GUI needed)
- Execute JavaScript successful nan sermon of web pages
- Safely negociate browser resources
- Work reliably successful a containerized environment
The setup too includes respective important configurations for moving successful a containerized environment:
- Proper Chrome Arguments: Essential flags for illustration --no-sandbox and --disable-dev-shm-usage for moving successful containers
- Environment-aware Path: Uses nan correct Chrome binary measurement from business variables
- Resource Management: Sets viewport size and disables unnecessary features
- Professional Bot Identity: Clear personification supplier and HTTP headers identifying our scraper
- Error Handling: Proper cleanup of browser pages to forestall practice leaks
While Puppeteer makes it easy to powerfulness Chrome programmatically, moving it successful a instrumentality requires owed strategy limitations and configuration. Let’s look astatine really we group this up successful our Docker environment.
Docker: Ensuring Consistent Environments
One of nan biggest challenges successful deploying web scrapers is ensuring they activity nan aforesaid measurement successful betterment and production. Your scraper mightiness activity perfectly connected your conception instrumentality but neglect successful nan unreality owed to missing limitations aliases different strategy configurations. Docker solves this by packaging everything nan exertion needs - from Node.js to Chrome itself - into a azygous instrumentality that runs identically everywhere.
Our Dockerfile sets up this accordant environment:
FROM node:18-alpine # Install Chromium and dependencies RUN apk adhd --no-cache \ chromium \ nss \ freetype \ harfbuzz \ ca-certificates \ ttf-freefont \ dumb-init # Set business variables ENV PUPPETEER_SKIP_CHROMIUM_DOWNLOAD=true \ PUPPETEER_EXECUTABLE_PATH=/usr/bin/chromium-browser \ PUPPETEER_DISABLE_DEV_SHM_USAGE=true
The Alpine-based image keeps our instrumentality lightweight while including each basal dependencies. When you tally this container, whether connected your laptop aliases successful DigitalOcean’s App Platform, you get nan nonstop aforesaid business pinch each nan correct versions and configurations for moving headless Chrome.
Development to Deployment
Let’s locomotion done getting this task up and running:
1. Local Development
First, fork nan illustration repository to your GitHub account. This gives you your ain transcript to activity pinch and deploy from. Then clone your fork locally:
# Clone your fork git clone https://github.com/YOUR-USERNAME/doappplat-puppeteer-sample.git cd doappplat-puppeteer-sample # Build and tally pinch Docker docker build -t gutenberg-scraper . docker tally -p 8080:8080 gutenberg-scraper
2. Understanding nan Code
The exertion is building astir 3 main components:
-
Book Service: Handles web scraping and accusation extraction
async searchBooks(query, page = 1) { await this.initialize(); await rateLimiter.wait(); const itemsPerPage = 24; const searchUrl = `${this.baseUrl}/ebooks/search/?query=${encodeURIComponent(query)}&start_index=${(page - 1) * itemsPerPage}`; } -
Express Server: Manages routes and renders templates
app.get('/book/:url(*)', async (req, res) => { try { const bookUrl = req.params.url; const bookDetails = await bookService.getBookDetails(bookUrl); res.render('book', { book: bookDetails, error: null }); } catch (error) { } }); -
Frontend Views: Clean, responsive UI utilizing Bootstrap
<div class="card book-card h-100"> <div class="card-body"> <span class="badge bg-secondary downloads-badge"> <%= book.downloads.toLocaleString() %> downloads </span> <h5 class="card-title"><%= book.title %></h5> <!-- ... overmuch UI elements ... --> </div> </div>
3. Deployment to DigitalOcean
Now that you personification your fork of nan repository, deploying to DigitalOcean App Platform is straightforward:
- Create a caller App Platform application
- Connect to your forked rep
- On resources, delete nan 2nd assets (that isn’t a Dockerfile); this is auto-generated by App Platform and not needed
- Deploy by clicking Create Resources
The exertion will beryllium automatically built and deployed, pinch App Platform handling each nan infrastructure details.
Conclusion
This Project Gutenberg scraper demonstrates really to build a applicable web exertion utilizing modern unreality technologies. By combining Puppeteer for web scraping, Docker for containerization, and DigitalOcean’s App Platform for deployment, we’ve created a solution that’s immoderate robust and easy to maintain.
The task serves arsenic a template for your ain web scraping applications, showing really to grip browser automation, negociate resources efficiently, and deploy to nan cloud. Whether you’re building a accusation postulation instrumentality aliases conscionable learning astir containerized applications, this illustration provides a coagulated instauration to build upon.
Check retired nan task connected GitHub to study overmuch and deploy your ain instance!