
🎯 What is this for?
- Keep it up to date: If you change a price on your website, the crawler will detect it on its next run.
- Huge knowledge base: Ideal if you have hundreds of help articles or product pages.
- Verification: Allows the agent to cite real sources (“According to our website…”).
🛠️ Configuring a Crawler (Step by Step)
When you click + Add Web Crawler, you’ll see this screen. Think of it as the mission map for your crawler.
1. Name for the crawler
Give it a name that clearly identifies the source.- Bad: “Test 1”.
- Good:
Official_Support_HelporBlog_Updates_2024.
2. Update Frequency (The Rhythm)
How often should it reread the website? Use the slider.- 24 hours: The standard option. Checks for changes once a day.
- More frequent: Use only for breaking news (it consumes more resources).
3. Crawl sources (The Strategy)
Here you decide how the bot enters your house:- 🌐 Website: The bot starts from the homepage and follows links one by one (like a curious human). Good for discovering content.
- 🗺️ Sitemap: You give it an exact map (
sitemap.xml). The bot goes straight to the listed URLs. Faster and more efficient.
4. Crawl Options (The Scope)
- Crawl Everything: It will read everything it finds.
- Sub-paths: You can restrict it to
/blogor/productsso it doesn’t waste time on the “About us” page.
⚠️ Important: Make sure your website does not block bots (check your robots.txt). If you shut the door, it won’t be able to learn!
📊 Status Traffic Lights (What’s happening?)
In the main list (first image), you’ll see small coloured dots. Here’s what they mean:- 🟡 Pending: In the queue. Waiting for its turn.
- 🔵 Running: Working. The bot is crawling the website right now.
- 🟢 Scraped: Success! ✅ The content has been read, processed and is now in the agent’s brain.
- 🔴 Failed: Something went wrong. The website may be down, require a login or block crawlers.
- 🟠 Downloaded: Downloaded but not yet processed (it’s still “digesting” the information).
🎓 Best Practices Summary (Cheat Sheet)
To keep a clean and useful knowledge library:- Use Sitemaps: Whenever possible, use the Sitemap option. It’s much cleaner than letting the bot wander through random links.
- Avoid junk pages: You don’t need to index “Shopping Cart”, “Login” or “Legal Notice”. Configure exclusions if you use advanced options.
- Clear naming: When you have 10 crawlers, you’ll be glad you named them
FAQ_ESandFAQ_ENinstead ofweb1andweb2. - Check the update rate: If your website rarely changes, don’t run the crawler every hour. You’ll only annoy the servers.
🆘 Quick Troubleshooting
| Problem | Likely Fix 🔧 |
|---|---|
| Status “Failed” 🔴 | Your website may have an anti-bot firewall. Check whether you need to whitelist it. |
| Reads too many pages | Switch to Sitemap or restrict the Path so it only reads what matters. |
| Information not updating | Check the “Update Frequency” slider. It might be set to “Monthly” when you need “Daily”. |
| The agent mixes data | Do you have two crawlers reading the same content? Remove duplicates. |
