The Beginner's Guide to Scraping Websites for LLM Data Enrichment
Tackle Web Scraping Like a Pro: Your Go To Guide For Everything Scraping
Getting large language models (LLMs) to perform at their best requires the right combination of data, content, and context, however, sourcing this data isn’t always straightforward. But where does this data come from? Whether it's your docs or website, or searching google or reddit, the answer is usually somewhere on the internet.
At first glance, web scraping might seem like a straightforward way to gather data, but the reality is much more complex. Many websites implement sophisticated defenses to block automated data collection, from CAPTCHA challenges to IP blocking and dynamic content loading.
In fact, studies suggest that up to 50% of website traffic can be bots, prompting companies to invest heavily in anti-scraping technologies. (Wired) Even if you clear the technical hurdles, extracting clean, structured, and meaningful data can still feel like searching for a needle in a haystack.
Scout offers one of the most powerful and customizable web scrapers on the market, enabling users to efficiently extract content from websites, even those pesky .gov sites. Users can customize the scraping process by selecting the scraper, defining text extraction methods, specifying CSS selectors to include or exclude certain HTML elements, and fine-tuning numerous other configuration options to ensure the perfect scrape.
The best scraper in the world won’t help if your data is messy. Scout lets you clean and organize your data as it’s extracted, saving you hours of post-scrape cleanup. Plus, built-in error handling helps manage tricky sites that throw unexpected twists your way.
Grab some snacks, pour yourself a drink, and settle in, this is going to be a long one. Welcome to the definitive guide to everything you’ve ever wanted to know about web scraping!
What is Web Scraping?
Before diving into the technical details, let’s quickly define web scraping. At its core, web scraping is the process of automatically extracting data from websites. This data can be anything from text and images to tables and links—essentially, any publicly available information on the web. By scraping websites, you can collect large volumes of data quickly and efficiently, which can then be used to train large language models (LLMs), enhance search capabilities, improve chatbots, or even track competitor activities.
Client Rendered vs. Server Rendered Websites
Not all websites are created equal. Websites can generally be categorized into two types: client-rendered (JavaScript-heavy) or server-rendered (static HTML). Depending on how a website is built, you’ll need different scraping strategies.
Here’s a few tricks to help determine if you’re looking at a client-rendered site vs a server-rendered site
1. Check the Page Source:
- Server-rendered (Static HTML): Right-click on the page and select View Page Source or Inspect. If the content you’re looking for (such as text or images) is visible in the raw HTML without interacting with the page, it is likely server-rendered.
- Client-rendered (JavaScript-heavy): If the page source contains minimal HTML and the content is missing, it’s likely because JavaScript is loading it dynamically. In this case, you’ll need a scraper like Playwright that can execute the JavaScript to render the full content.
2. Inspect Network Activity (Using Developer Tools):
- Open the browser’s Developer Tools (right-click the page and select Inspect, then go to the Network tab).
- Server-rendered sites: The content will load directly with the initial HTTP request. Look for network requests that return HTML documents containing the full page content.
- Client-rendered sites:
After the initial HTML is loaded, you’ll see additional requests for JavaScript files, API calls, and AJAX requests that load content dynamically. Look for XHR (XMLHttpRequest) or Fetch requests that fetch JSON or HTML data after the page loads.
Scouts Scrapers
Now that we know a bit about the different sites we may encounter lets look at the different scrapers and text extractors Scout offers.
- HTTP Scraper:
- Fast and efficient for static websites with content in the initial HTML. Best for simple, server-rendered sites. Use this when you don’t need a full browser experience. Less overhead means quicker scrapes and cleaner data.
- Playwright Scraper:
- This one’s your go-to for the tough stuff. It mimics a real user browsing the site, executing JavaScript to reveal content that’s not visible on page load. Slower and more resource-heavy, but necessary for dynamic sites that use JavaScript to load content. Ideal for browser-rendered sites.
Text Extractors
When setting up your scraping job, one of the most important decisions you'll make is choosing the right text extractor. The text extractor is the tool that helps you pull out the relevant content from a webpage, and selecting the appropriate method can make all the difference. Scout offers three primary text extractors: Custom, Readability, and Trafilatura. Each method has its strengths and is best suited to different types of scraping needs.
Custom Text Extractor:
The Custom text extractor is the most flexible option, offering complete control over how you scrape data from a website. With this method, you can define your own scraping logic based on your needs. This means you can select specific HTML elements, use CSS selectors, or even apply regular expressions to capture exactly the data you want from a page.
While the Custom extractor offers unparalleled flexibility, it requires some technical expertise in finding the correct class or IDs to include or exclude.
The biggest advantage of using the Custom extractor is that it’s ideal for more complex scraping scenarios. If you're scraping a website with an unusual structure, or if you need to extract very specific elements of a page that other extractors can’t handle, the Custom extractor gives you the freedom to get that perfect scrape.
Readability Text Extractor:
For users who want an easier and more automated scraping process, the Readability text extractor is a great choice. This option is built around Scout’s smart scraping logic, which automatically identifies the most relevant components of a webpage. Essentially, the Readability extractor works like an editor, filtering out unnecessary content like ads, sidebars, or navigation menus, and focusing on the main body of text, such as articles, blog posts, and similar content. This makes it ideal for scraping websites that primarily contain text-based content.
The Readability extractor excels when you want to quickly extract clean and relevant content without diving into technical customization. It’s an ideal choice for general web scraping tasks, especially when you want to capture readable text such as news articles or blog posts. Since it automatically filters out irrelevant content, it ensures that only the most important text is included in the scrape, saving you time and effort.
However, the Readability extractor does have its limitations. While it does a great job at extracting readable content, it might not capture everything you need from a webpage, especially if the page has complex formatting or contains other types of content like images, videos, or interactive elements.
Trafilatura Text Extractor:
If you’re looking for a more powerful text extraction tool, particularly for long-form content or research-oriented websites, Trafilatura is a great option. Trafilatura is a Python package and command-line tool designed to extract high-quality, structured text from webpages. It’s particularly useful for scraping complex, content-rich websites, such as academic papers, research reports, or lengthy blog posts. The main strength of Trafilatura lies in its ability to handle large amounts of text while filtering out extraneous elements like ads, comments, and navigation links, leaving only the main body of text.
Trafilatura Type:
Trafilatura provides advanced settings to fine-tune your scrapes, helping you balance between precision and recall depending on your needs:
- Precision: Focuses on extracting only the most central and relevant content, reducing noise in the results. Best for well-structured pages where accuracy is key.
- Recall: Expands the scope of extraction to capture more elements, helpful if key parts of your documents are missing. Ideal for scraping less structured or fragmented content.
- Default: Outputs plain text (TXT) without additional metadata, offering clean, straightforward extraction.
- Baseline: Uses a faster extraction method targeting text paragraphs and JSON metadata for efficient data handling.
- Html2txt: Maximizes recall by extracting all available text from a document, capturing as much content as possible.
For a full breakdown of Trafilatura’s features and customization tools, check out their documentation.
Sitemap vs Crawl
When scraping a website, one of the first decisions you’ll need to make is whether to use a crawl or a sitemap to guide the scraping process. Both methods are valid, but they serve different purposes and are best suited to specific scenarios. Here are some key considerations.
A crawl involves systematically exploring a website by following its links to discover and scrape content. This approach is like a digital spider, starting at a single page (often the homepage) and branching out to other linked pages. Crawling is particularly useful when a website doesn’t provide a well-structured sitemap or when you want to scrape content from parts of the site not explicitly listed in the sitemap
Crawling however, can be time-consuming, especially on large websites with deep link structures.Without careful configuration, a crawl may encounter duplicate pages or unnecessary content like navigation links.
A sitemap is a file provided by a website that lists its important pages, often in a structured format like XML. Sitemaps are essentially a roadmap of a website, designed to help search engines and scrapers find content quickly and efficiently. If a website has a sitemap, it’s often the fastest and easiest way to guide your scrape. If a sitemap is poorly organized or overly simplistic, it might not be as helpful. To check if a website has a sitemap append /sitemap.xml to the end of the URL. Another option is to plug the URL of the website you want to scrape into a tool like https://seomator.com/sitemap-finder. For example https://docs.scoutos.com/sitemap.xml will lead you to:
Choosing Between a Crawl and a Sitemap
- Use a crawl when:
- The site doesn’t have a sitemap or the sitemap is incomplete.
- You want to discover every possible page, even those not explicitly listed.
- The site has a deep or complex structure that a sitemap might not fully represent.
- You need comprehensive coverage, including dynamically linked or hidden pages.
- Use a sitemap when:
- The site provides a clear and well-maintained sitemap.
- You want to target specific, high-priority pages or sections listed in the sitemap.
- Efficiency is critical, and you want to reduce the load on the website’s servers.
- You’re scraping a small or straightforward site where the sitemap includes most if not all of the relevant content.
Scouts Advanced Scrape Options
Scout provides advanced configuration options to help you refine your scrapes and remove unwanted information.
- Exclude Pages with a last Mod. Date Prior to:
- Avoid scraping outdated content by setting a threshold for the "last modified" date. This option works with sitemaps that display the last modification date for their pages.
- Max Depth:
- This setting lets you define how deep your scraper will explore a website, limiting the number of link levels it follows from the starting page. Depth is measured by the number of "clicks" it takes to reach a page from the homepage. For example, the deeper the page, the more clicks (or levels) are required to access it. An easy way to gauge depth is by counting the forward slashes in a URL. For instance, a depth of 4 might look something like this:
https://docs.scoutos.com/docs/workflows/blocks/continue-if
This feature is particularly helpful when you want to focus on top-level content or avoid getting lost in the weeds of deeply nested pages.
- Strip:
- manages text cleanup during HTML to Markdown conversion to ensure consistent formatting.
- When Strip = True (the default setting), leading and trailing whitespace are removed, multiple blank lines are collapsed into single ones, and the output becomes cleaner and more standardized. This is ideal for creating neat and consistent text
- On the other hand, setting strip=False preserves all whitespace and newlines, making it useful for cases where maintaining precise formatting is essential. This parameter helps manage diverse HTML formatting styles when storing web content.
- Strip URLs:
- Is the process of removing unnecessary parts of a URL, such as query parameters, fragments, or trailing slashes, to simplify or standardize it.
- Why Strip URLs?
- Avoiding Duplicates: Many websites have multiple URLs that point to the same content. For example:
- example.com/page?ref=facebook
- example.com/page
- Stripping the query parameter (?ref=facebook) ensures these URLs are treated as the same page, preventing duplicate scraping.
- Avoiding Duplicates: Many websites have multiple URLs that point to the same content. For example:
- Allow:
- A list of allowed paths or patterns that the scraper will include in the crawl.
- For example, if you wish to only scrape a websites blog or news you would set the allow to /blog, /news
- Deny:
- Similar to the above Allow feature, this is a list of paths or patterns that the scraper will exclude from the crawl.
- Allowed domains:
- Specify which domains are allowed in the crawl. Links to other domains will be excluded unless specified here.
Monitoring Web Scrapes:
You can monitor the progress of your web scrapes in real-time from the dashboard via the jobs page. Once you click the “Run” button to start a web scrape, you’ll land on a page showing you the progress and status of each job.
While on the jobs page, you can inspect each page as they come in by clicking on them. This will open a side panel which shows the content that has been retrieved. This is the first place one should check to ensure a good clean scrape. If you notice you left out an exclusion which is causing noise in the scraped content or don’t want to re configure a new job you can click on “View Full Configuration” at the top of the jobs page which will open a side panel of the configs used on the previous job. This JSON is editable allowing you to make quick changes to the jobs config and rerunning the job with new config parameters. This is the best place to iterate on dialing in your scrapes.
Examples of good and bad scrapes:
Not sure what we mean by noise and unwanted content in your scrapes? Lets look at a few scrapes and see what each of the configs return and how to clean up unwanted information that can affect recall.
Below is a scrape of the Scout docs site using HTTP/Custom without any exclusions and we can see that while we grabbed the main content of the page we also pulled in the side bar information.
Depending on your use case this added information can be detrimental for answer recall and possibly lead to hallucinations. For example, we’re looking at the Build a RAG app in under 5 minutes page. While this page contains all the Quickstart information the added sidebar content isn’t ideal.
Lets image you’re querying a chat bot with this information part of its internal knowledge base.
If you send in a vague query regarding a “slack bot,” which is one of the links in the side bar menu we may very well pull in this unrelated getting started page. In order to clean up a scrape like this, lets inspect the page and find the class for the sidebar. In this case it would be the class “.fern-sidebar-group”.
Now if we rerun this job with the class “.fern-sidebar-group” added to the exclusions we should see a much cleaner scape. Removing noise from your scrapped data is key in lowering hallucinations in your output. Remember when excluding a class add a period “.” prior to the class name. When excluding an ID you will add a hashtag “#” before the ID.
Now you may ask, why don’t I just “include” the one class that I know contains the body of the page and the majority of the information. Well unless you know that website backward and forwards there may be pages that don’t adhere to the naming structure as the page you’re currently viewing.
Some .gov sites for example are notorious for adding contact info and added tidbits in an aside or part of the header.
For example:
City councilman bios on the denver.org website have contact info as well as some other information separate from the main body or main content of the page. So if we’re using an include selector “body” the information in the aside would be missed. This is a perfect example of why NOT to use just a single selector for a websites content unless you are familiar with the site and layout.
Scrape Frequency, Scheduling a CRON
Inaccurate or outdated data can compromise the response quality, lead to irrelevant chatbot responses, or reduce the effectiveness of search functions. Scout’s scheduling feature allows you to automate scraping jobs at intervals that match your needs. Whether you need updates every hour, daily, weekly, or on a custom schedule, CRON-based frequency settings make it simple to set up recurring scrapes. This flexibility ensures that your knowledge base, customer support tools, and competitive analyses are always working with the most up-to-date information available.
Conclusion
Web scraping for data enrichment doesn’t need to be a challenge. With Scout’s powerful and easy-to-use tools, you can quickly transform raw web data into valuable insights—without the hassle. Whether you’re looking to enhance your LLMs, keep your knowledge base fresh, or stay ahead of the competition, Scout has you covered. Sign up for free today and start scraping.