Announcing: The Scout CLI and AI Workflows as Code
Learn More
Tutorials

Web Scraping with LLMs with GPT-4o, Deno & Scout SDK

Learn how to automate web scraping with LLMs using Deno, Scout SDK, and GPT-4o. Scrape entire sitemaps, extract structured JSON, and process web content efficiently. Perfect for AI and data pipeline workflows!

Alex BoquistAlex Boquist
Share article:

If you're looking for a scalable, AI-powered way to extract structured data from websites, this guide will show you how to:

  • Scrape a full website sitemap using Deno
  • Process web pages in batches with Scout SDK
  • Use GPT-4o to extract structured JSON from raw web content
  • Store results in JSONL format for seamless integration

By combining Deno, Scout SDK, and GPT-4o, you can automate web scraping for LLM applications, data pipelines, and real-time analytics.

Watch the Full Tutorial on YouTube:

Why Use LLMs for Web Scraping?

Traditional scraping methods rely on CSS selectors and regex to extract content. But websites are increasingly JavaScript-heavy, dynamic, and inconsistent.

By integrating GPT-4o, you can:

  • Extract structured JSON without custom parsing rules
  • Summarize and process unstructured content dynamically
  • Adapt to different website structures
  • Scale up web scraping with batch processing

Scout SDK and GPT-4o make web scraping easier, smarter, and scalable.

How This Web Scraping Workflow Works

This workflow will:

  1. Scrape a website’s sitemap to retrieve all URLs
  2. Send pages in batches to the Scout SDK API
  3. Use LLMs to extract and structure content into JSON
  4. Save results in JSONL format for further analysis

Step 1: Install Dependencies

Make sure you have Deno installed. If not, install it using the CLI:

Deno CLI Setup

Deno natively supports TypeScript and ES modules, so you can import modules directly:

typescriptCopyEditimport { ScoutClient } from "npm:scoutos";
import Sitemapper from "npm:sitemapper";

Required Libraries:
Scout SDK – API for executing LLM-powered workflows
Sitemapper – Fetches a website’s sitemap for crawling

Step 2: Fetch the Sitemap

To efficiently scrape a site, start by fetching its sitemap.xml:

main.tsjavascript
async function fetchSites(sitemapUrl: string): Promise<string[]> {
  console.log(`Fetching sites from ${sitemapUrl}`);
  const sitemap = new Sitemapper();
  const { sites } = await sitemap.fetch(sitemapUrl);
  console.log(`Fetched ${sites.length} sites`);
  return sites;
}

Why use a sitemap?
A sitemap.xml file provides a list of URLs, making it faster and more accurate than traditional crawling.

Step 3: Process URLs in Batches

Instead of processing pages one by one, we batch them in groups of 20 for efficiency.

main.tsjavascript
async function processBatch( client: ScoutClient, batch: string[] ): Promise<void> {
  console.log(`Processing batch of ${batch.length} sites`);
  const promises = batch.map(async (url) => {
    console.log(`Processing URL: ${url}`);
    const res = await client.workflows.run("wf_cm6k51et900020ds697g79atr", {
      inputs: { url: url },
    });

    let json = res.run.state.json.output;
    json.url = url;

    await Deno.writeTextFile("pages.jsonl", JSON.stringify(json) + "\n", {
      append: true,
    });
    console.log(`Processed and saved URL: ${url}`);
  });

  await Promise.all(promises);
  console.log(`Finished processing batch`);
}

Each URL is sent to Scout SDK, which extracts structured JSON using GPT-4o.
The processed JSON output is saved in JSONL format for easy storage.

Step 4: Run the Web Scraper

Now, tie everything together with the main function:

main.tsjavascript
async function main() {
  console.log("ScoutClient", ScoutClient);

  const sites = await fetchSites("https://docs.scoutos.com/sitemap.xml");

  const client = new ScoutClient({
    apiKey: process.env.api_key,
  });

  for (let i = 0; i < sites.length; i += 20) {
    const batch = sites.slice(i, i + 20);
    await processBatch(client, batch);
  }

  console.log("All batches processed");
}

if (import.meta.main) {
  await main();
}

This script will:

  • Scrape all pages from the sitemap
  • Run LLM-powered extraction for each page
  • Save structured JSON output in pages.jsonl

Watch the Full Tutorial on YouTube

Want a step-by-step demo of how to implement this?

Previous Video: Scout Workflow Setup

Why Use Deno + Scout SDK for Web Scraping?

Deno is Secure & Fast

  • No node_modules required – Uses modern ES modules
  • Secure by default – No file/network access unless explicitly allowed
  • Built-in TypeScript support

Scout SDK Makes LLM-Powered Scraping Easy

  • Runs structured extraction workflows with GPT-4o
  • No need for complex parsing – AI handles it
  • Scalable & API-driven

Final Thoughts

By combining Deno, Scout SDK, and GPT-4o, you can scrape websites smarter and extract structured data at scale.

What would you use LLM-powered web scraping for? Drop a comment below or on YouTube.

Resources Mentioned

Scout SDK
Deno CLI
Sitemapper (npm package)
Previous Video (Scout Workflow Setup)

Alex BoquistAlex Boquist
Share article:

Ready to get started?

Sign up for free or chat live with a Scout engineer.

Try for free