Web Scraping with LLMs with GPT-4o, Deno & Scout SDK

If you're looking for a scalable, AI-powered way to extract structured data from websites, this guide will show you how to:

Scrape a full website sitemap using Deno
Process web pages in batches with Scout SDK
Use GPT-4o to extract structured JSON from raw web content
Store results in JSONL format for seamless integration

By combining Deno, Scout SDK, and GPT-4o, you can automate web scraping for LLM applications, data pipelines, and real-time analytics.

Watch the Full Tutorial on YouTube:

Why Use LLMs for Web Scraping?

Traditional scraping methods rely on CSS selectors and regex to extract content. But websites are increasingly JavaScript-heavy, dynamic, and inconsistent.

By integrating GPT-4o, you can:

Extract structured JSON without custom parsing rules
Summarize and process unstructured content dynamically
Adapt to different website structures
Scale up web scraping with batch processing

Scout SDK and GPT-4o make web scraping easier, smarter, and scalable.

How This Web Scraping Workflow Works

This workflow will:

Scrape a website’s sitemap to retrieve all URLs
Send pages in batches to the Scout SDK API
Use LLMs to extract and structure content into JSON
Save results in JSONL format for further analysis

Step 1: Install Dependencies

Make sure you have Deno installed. If not, install it using the CLI:

Deno CLI Setup

Deno natively supports TypeScript and ES modules, so you can import modules directly:

typescriptCopyEditimport { ScoutClient } from "npm:scoutos";
import Sitemapper from "npm:sitemapper";

Required Libraries:
Scout SDK – API for executing LLM-powered workflows
Sitemapper – Fetches a website’s sitemap for crawling

Step 2: Fetch the Sitemap

To efficiently scrape a site, start by fetching its sitemap.xml:

main.tsjavascript

async function fetchSites(sitemapUrl: string): Promise<string[]> {
  console.log(`Fetching sites from ${sitemapUrl}`);
  const sitemap = new Sitemapper();
  const { sites } = await sitemap.fetch(sitemapUrl);
  console.log(`Fetched ${sites.length} sites`);
  return sites;
}

Why use a sitemap?
A sitemap.xml file provides a list of URLs, making it faster and more accurate than traditional crawling.

Step 3: Process URLs in Batches

Instead of processing pages one by one, we batch them in groups of 20 for efficiency.

main.tsjavascript

async function processBatch( client: ScoutClient, batch: string[] ): Promise<void> {
  console.log(`Processing batch of ${batch.length} sites`);
  const promises = batch.map(async (url) => {
    console.log(`Processing URL: ${url}`);
    const res = await client.workflows.run("wf_cm6k51et900020ds697g79atr", {
      inputs: { url: url },
    });

    let json = res.run.state.json.output;
    json.url = url;

    await Deno.writeTextFile("pages.jsonl", JSON.stringify(json) + "\n", {
      append: true,
    });
    console.log(`Processed and saved URL: ${url}`);
  });

  await Promise.all(promises);
  console.log(`Finished processing batch`);
}

Each URL is sent to Scout SDK, which extracts structured JSON using GPT-4o.
The processed JSON output is saved in JSONL format for easy storage.

Step 4: Run the Web Scraper

Now, tie everything together with the main function:

main.tsjavascript

async function main() {
  console.log("ScoutClient", ScoutClient);

  const sites = await fetchSites("https://docs.scoutos.com/sitemap.xml");

  const client = new ScoutClient({
    apiKey: process.env.api_key,
  });

  for (let i = 0; i < sites.length; i += 20) {
    const batch = sites.slice(i, i + 20);
    await processBatch(client, batch);
  }

  console.log("All batches processed");
}

if (import.meta.main) {
  await main();
}

This script will:

Scrape all pages from the sitemap
Run LLM-powered extraction for each page
Save structured JSON output in pages.jsonl

Watch the Full Tutorial on YouTube

Want a step-by-step demo of how to implement this?

Previous Video: Scout Workflow Setup

Why Use Deno + Scout SDK for Web Scraping?

Deno is Secure & Fast

No node_modules required – Uses modern ES modules
Secure by default – No file/network access unless explicitly allowed
Built-in TypeScript support

Scout SDK Makes LLM-Powered Scraping Easy

Runs structured extraction workflows with GPT-4o
No need for complex parsing – AI handles it
Scalable & API-driven

Final Thoughts

By combining Deno, Scout SDK, and GPT-4o, you can scrape websites smarter and extract structured data at scale.

What would you use LLM-powered web scraping for? Drop a comment below or on YouTube.

Resources Mentioned

Scout SDK
Deno CLI
Sitemapper (npm package)
Previous Video (Scout Workflow Setup)

Why Use LLMs for Web Scraping?

How This Web Scraping Workflow Works

Step 1: Install Dependencies

Step 2: Fetch the Sitemap

Step 3: Process URLs in Batches

Step 4: Run the Web Scraper

Watch the Full Tutorial on YouTube

Why Use Deno + Scout SDK for Web Scraping?

Deno is Secure & Fast

Scout SDK Makes LLM-Powered Scraping Easy

Final Thoughts

Resources Mentioned

Ready to get started?