Web Scraping with LLMs with GPT-4o, Deno & Scout SDK
Learn how to automate web scraping with LLMs using Deno, Scout SDK, and GPT-4o. Scrape entire sitemaps, extract structured JSON, and process web content efficiently. Perfect for AI and data pipeline workflows!

If you're looking for a scalable, AI-powered way to extract structured data from websites, this guide will show you how to:
- Scrape a full website sitemap using Deno
- Process web pages in batches with Scout SDK
- Use GPT-4o to extract structured JSON from raw web content
- Store results in JSONL format for seamless integration
By combining Deno, Scout SDK, and GPT-4o, you can automate web scraping for LLM applications, data pipelines, and real-time analytics.
Watch the Full Tutorial on YouTube:
Why Use LLMs for Web Scraping?
Traditional scraping methods rely on CSS selectors and regex to extract content. But websites are increasingly JavaScript-heavy, dynamic, and inconsistent.
By integrating GPT-4o, you can:
- Extract structured JSON without custom parsing rules
- Summarize and process unstructured content dynamically
- Adapt to different website structures
- Scale up web scraping with batch processing
Scout SDK and GPT-4o make web scraping easier, smarter, and scalable.
How This Web Scraping Workflow Works
This workflow will:
- Scrape a website’s sitemap to retrieve all URLs
- Send pages in batches to the Scout SDK API
- Use LLMs to extract and structure content into JSON
- Save results in JSONL format for further analysis
Step 1: Install Dependencies
Make sure you have Deno installed. If not, install it using the CLI:
Deno natively supports TypeScript and ES modules, so you can import modules directly:
typescriptCopyEditimport { ScoutClient } from "npm:scoutos";
import Sitemapper from "npm:sitemapper";
Required Libraries:
Scout SDK – API for executing LLM-powered workflows
Sitemapper – Fetches a website’s sitemap for crawling
Step 2: Fetch the Sitemap
To efficiently scrape a site, start by fetching its sitemap.xml:
async function fetchSites(sitemapUrl: string): Promise<string[]> {
console.log(`Fetching sites from ${sitemapUrl}`);
const sitemap = new Sitemapper();
const { sites } = await sitemap.fetch(sitemapUrl);
console.log(`Fetched ${sites.length} sites`);
return sites;
}
Why use a sitemap?
A sitemap.xml file provides a list of URLs, making it faster and more accurate than traditional crawling.
Step 3: Process URLs in Batches
Instead of processing pages one by one, we batch them in groups of 20 for efficiency.
async function processBatch( client: ScoutClient, batch: string[] ): Promise<void> {
console.log(`Processing batch of ${batch.length} sites`);
const promises = batch.map(async (url) => {
console.log(`Processing URL: ${url}`);
const res = await client.workflows.run("wf_cm6k51et900020ds697g79atr", {
inputs: { url: url },
});
let json = res.run.state.json.output;
json.url = url;
await Deno.writeTextFile("pages.jsonl", JSON.stringify(json) + "\n", {
append: true,
});
console.log(`Processed and saved URL: ${url}`);
});
await Promise.all(promises);
console.log(`Finished processing batch`);
}
Each URL is sent to Scout SDK, which extracts structured JSON using GPT-4o.
The processed JSON output is saved in JSONL format for easy storage.
Step 4: Run the Web Scraper
Now, tie everything together with the main function:
async function main() {
console.log("ScoutClient", ScoutClient);
const sites = await fetchSites("https://docs.scoutos.com/sitemap.xml");
const client = new ScoutClient({
apiKey: process.env.api_key,
});
for (let i = 0; i < sites.length; i += 20) {
const batch = sites.slice(i, i + 20);
await processBatch(client, batch);
}
console.log("All batches processed");
}
if (import.meta.main) {
await main();
}
This script will:
- Scrape all pages from the sitemap
- Run LLM-powered extraction for each page
- Save structured JSON output in
pages.jsonl
Watch the Full Tutorial on YouTube
Want a step-by-step demo of how to implement this?
Previous Video: Scout Workflow Setup
Why Use Deno + Scout SDK for Web Scraping?
Deno is Secure & Fast
- No
node_modules
required – Uses modern ES modules - Secure by default – No file/network access unless explicitly allowed
- Built-in TypeScript support
Scout SDK Makes LLM-Powered Scraping Easy
- Runs structured extraction workflows with GPT-4o
- No need for complex parsing – AI handles it
- Scalable & API-driven
Final Thoughts
By combining Deno, Scout SDK, and GPT-4o, you can scrape websites smarter and extract structured data at scale.
What would you use LLM-powered web scraping for? Drop a comment below or on YouTube.
Resources Mentioned
Scout SDK
Deno CLI
Sitemapper (npm package)
Previous Video (Scout Workflow Setup)