Skip to main content
We are Brand SEO Beijing serving international business, your marketing partner, Contact us by mi@mgsh.com.cn

Search Engine Spiders - How do web crawlers work?

SEOer doWebsite optimizationWeb spiders, web crawlers (also known as web spiders, web robots, and in the FOAF community, more often called web chasers), are programs that automatically crawl information from the World Wide Web according to certain rules or script.Other less commonly used names are ant, autoindex, emulator, or worm.

search engine spider

The indexing and ranking of web pages is inseparable from spiders. A spider is actually a crawling program that can crawl website information through the URL address of the website.

  • 1. Baidu Spider: Baiduspider
  • 2. Google Spider: Googlebot
  • 3. 360 Spider: 360Spider
  • 4. SOSO spider: Sosospider
  • 5. Yahoo Spider: "Yahoo! Slurp China" or Yahoo!
  • 6. Youdao Spider: YoudaoBot, YodaoBot
  • 7. Sogou Spider: Sogou News Spider, Sogou XXX spider, etc.
  • 8. MSN Spider: msnbot, msnbot-media
  • 9. Bing Spider: bingbot
  • 10. A search for spiders: YisouSpider
  • 11. Alexa Spider: ia_archiver
  • 12. Easou Spider: EasouSpider
  • 13. Instant Spider: JikeSpider
  • 14. Etao Spider: EtaoSpider
  • 15. Today's headline spider: Bytespider

These are said to be foreign spiders YandexBot, AhrefsBot and ezooms.bot

The case for web crawlers

Website Log Traffic Statistics Spider - SEO
Let's take a look at the comparison of search engines:
Crawler crawler-Baidu performs well
The highest is Baidu. No matter how much the market of search engines or browsers boast, it is impossible to deny the diligence of Baidu spiders/servers and the role it plays as a search engine in my country. Others cannot do it.
Access Terminal - widows comes first
The terminal that accesses the website is still windows, and users who use windows access through the PC terminal, and the toB business still relies on the computer.
baiduspider-Baidu spider crawling case
Baidu spider, Baiduspider in User-Agent, as shown above, if the value of User-agent is None or other browsers, but there is no search engine, it may be a person-user visit, not a machine-spider visit.

robots protocol

The first file a spider crawls into a website can control the content of the website that the spider crawls. Of course, some spiders do not obey the spider's rules. If you deny it access, it can still access, because this rule is not written into the law. The law is still quite behind on a technical level.
User-Agent: *
Disallow: /
Here * represents all types of search engines, * is a wildcard Disallow: /admin/ This definition is to prohibit crawling the directory under the admin directory Disallow: /require/ This definition is to prohibit crawling the directory under the require directory Disallow: / ABC/ The definition here is to prohibit the crawling of the directory under the ABC directory. Disallow: /cgi-bin/*.htm It is forbidden to access all URLs (including subdirectories) suffixed with ".htm" in the /cgi-bin/ directory. Disallow: /*?* Forbid access to all URLs containing question marks (?) adc.html file. Allow: /cgi-bin/ The definition here is to allow the directory under the cgi-bin directory to be crawled Allow: /tmp The definition here is to allow the entire directory of tmp to be crawled Allow: .htm$ Only allow access to the suffixed ".htm" URL. Allow: .gif$ Allows to crawl web pages and gif format images Sitemap: Sitemap tells the crawler that this page is a sitemap
Extended reading of the instructions for use of robots:
How to use robots
A Beginner's Guide to SEO in Five Minutes

Back to Top