Skip to main content
Crawler Library List If Edit Agent is the brain 🧠, Crawlers are the textbooks 📚. This is where you send your bot to read websites so it can learn about your products, policies or news. Instead of copying and pasting text manually, you simply tell it: “Go to this website, read everything and remember it.”

🎯 What is this for?

  • Keep it up to date: If you change a price on your website, the crawler will detect it on its next run.
  • Huge knowledge base: Ideal if you have hundreds of help articles or product pages.
  • Verification: Allows the agent to cite real sources (“According to our website…”).

🛠️ Configuring a Crawler (Step by Step)

When you click + Add Web Crawler, you’ll see the configuration screen. Think of it as the mission map for your crawler. Web Crawl Configuration

1. Name for the crawler

Give it a name that clearly identifies the source.
  • Bad: “Test 1”.
  • Good: Official_Support_Help or Blog_Updates_2024.

2. Update Frequency (The Rhythm)

How often should it reread the website? Use the slider.
  • 24 hours: The standard option. Checks for changes once a day.
  • More frequent: Use only for breaking news (it consumes more resources).

3. Crawl sources (The Strategy)

Here you decide how the bot enters your house:
  • 🌐 Website: The bot starts from the homepage and follows links one by one (like a curious human). Good for discovering content.
  • 🗺️ Sitemaps (New & Improved!): You give it an exact map (sitemap.xml). The best part? You are no longer limited to just one. You can click + Add another sitemap to feed the bot multiple maps at once. You can also use the Check Sitemaps button to verify they are working properly before launching. Faster, cleaner, and much more efficient.

4. Crawl Options (The Scope)

  • Crawl Everything: It will read everything it finds.
  • Sub-paths: You can restrict it to /blog or /products so it doesn’t waste time on the “About us” page.
⚠️ Important: Make sure your website does not block bots (check your robots.txt). If you shut the door, it won’t be able to learn!

📄 Managing your Knowledge (The “Pages” Tab)

Pages Management Tab Once your crawler is set up, head over to the Pages tab. This is your control centre for the specific URLs the bot is reading. We’ve added some powerful new tools here:

➕ Add Page (Surgical Precision)

Sometimes you don’t need to crawl a whole website or an entire sitemap. If you have just published a single new blog post or an external article you want the bot to learn right now, simply click + Add Page. This allows you to manually inject specific URLs straight into the bot’s brain.

🔄 Rescrape Downloads (The Second Chance)

Did a website connection glitch out? Or maybe you updated the text on your website and want the bot to learn it immediately without waiting for the next scheduled cycle? Click the Rescrape Downloads button. This tells the system: “Take all the documents we’ve already downloaded and try to extract their information again.” It’s the perfect refresh button.

📊 Status Traffic Lights (What’s happening?)

In the Pages list, you’ll see the exact status of every single URL. Here’s what they mean:
  • 🟢 Scraped / Indexed: Success! ✅ The content has been read, processed and is now safely stored in the agent’s brain.
  • 🟠 Downloaded: The page has been downloaded but not yet processed (it’s still “digesting” the information).
  • 🔴 Error: Something went wrong. The website might be down, require a login, or has an anti-bot firewall blocking the way.

🎓 Best Practices Summary (Cheat Sheet)

To keep a clean and useful knowledge library:
  • Multiple Sitemaps are your best friend: Instead of crawling a massive website, provide specific sitemaps (e.g., sitemap-products.xml and sitemap-blog.xml). It keeps the bot focused.
  • Avoid junk pages: You don’t need to index “Shopping Cart”, “Login” or “Legal Notice”.
  • Clear naming: When you have 10 crawlers, you’ll be glad you named them FAQ_ES and FAQ_EN instead of web1 and web2.
  • Surgical additions: Use the + Add Page button for quick updates instead of forcing a full crawl of your site.

🆘 Quick Troubleshooting

ProblemLikely Fix 🔧
Status says “Error” 🔴Your website might be blocking bots. Check your firewall settings. If it was a temporary glitch, try hitting Rescrape Downloads.
Reads too many pagesSwitch to the Sitemap option or restrict the Sub-paths so it only reads what matters.
Information not updatingCheck the “Update Frequency” slider. It might be set to “Monthly” when you need “Daily”.
The agent mixes dataDo you have two crawlers reading the same content? Remove duplicates.
All set! With this in place, your agent will stop improvising and start answering with real, up-to-date data from your websites. 🕵️‍♂️📚