Skip to main content
Crawler Library List If Edit Agent is the brain 🧠, Crawlers are the textbooks 📚. This is where you send your bot to read websites so it can learn about your products, policies or news. Instead of copying and pasting text manually, you simply tell it: “Go to this website, read everything and remember it.”

🎯 What is this for?

  • Keep it up to date: If you change a price on your website, the crawler will detect it on its next run.
  • Huge knowledge base: Ideal if you have hundreds of help articles or product pages.
  • Verification: Allows the agent to cite real sources (“According to our website…”).

🛠️ Configuring a Crawler (Step by Step)

When you click + Add Web Crawler, you’ll see this screen. Think of it as the mission map for your crawler. Web Crawl Configuration

1. Name for the crawler

Give it a name that clearly identifies the source.
  • Bad: “Test 1”.
  • Good: Official_Support_Help or Blog_Updates_2024.

2. Update Frequency (The Rhythm)

How often should it reread the website? Use the slider.
  • 24 hours: The standard option. Checks for changes once a day.
  • More frequent: Use only for breaking news (it consumes more resources).

3. Crawl sources (The Strategy)

Here you decide how the bot enters your house:
  • 🌐 Website: The bot starts from the homepage and follows links one by one (like a curious human). Good for discovering content.
  • 🗺️ Sitemap: You give it an exact map (sitemap.xml). The bot goes straight to the listed URLs. Faster and more efficient.

4. Crawl Options (The Scope)

  • Crawl Everything: It will read everything it finds.
  • Sub-paths: You can restrict it to /blog or /products so it doesn’t waste time on the “About us” page.
⚠️ Important: Make sure your website does not block bots (check your robots.txt). If you shut the door, it won’t be able to learn!

📊 Status Traffic Lights (What’s happening?)

In the main list (first image), you’ll see small coloured dots. Here’s what they mean:
  • 🟡 Pending: In the queue. Waiting for its turn.
  • 🔵 Running: Working. The bot is crawling the website right now.
  • 🟢 Scraped: Success! ✅ The content has been read, processed and is now in the agent’s brain.
  • 🔴 Failed: Something went wrong. The website may be down, require a login or block crawlers.
  • 🟠 Downloaded: Downloaded but not yet processed (it’s still “digesting” the information).

🎓 Best Practices Summary (Cheat Sheet)

To keep a clean and useful knowledge library:
  • Use Sitemaps: Whenever possible, use the Sitemap option. It’s much cleaner than letting the bot wander through random links.
  • Avoid junk pages: You don’t need to index “Shopping Cart”, “Login” or “Legal Notice”. Configure exclusions if you use advanced options.
  • Clear naming: When you have 10 crawlers, you’ll be glad you named them FAQ_ES and FAQ_EN instead of web1 and web2.
  • Check the update rate: If your website rarely changes, don’t run the crawler every hour. You’ll only annoy the servers.

🆘 Quick Troubleshooting

ProblemLikely Fix 🔧
Status “Failed” 🔴Your website may have an anti-bot firewall. Check whether you need to whitelist it.
Reads too many pagesSwitch to Sitemap or restrict the Path so it only reads what matters.
Information not updatingCheck the “Update Frequency” slider. It might be set to “Monthly” when you need “Daily”.
The agent mixes dataDo you have two crawlers reading the same content? Remove duplicates.
All set! With this in place, your agent will stop improvising and start answering with real data from your website. 🕵️‍♂️📚