The Importance of Robots.txt

When included in the design of a website, the robots.txt file is an integrated set of instructions for search engine web crawlers. By implementing the robots.tx file, a web designer can direct each search engine to the most relevant information about the site, and avoid unimportant or proprietary data that would not be pertinent or useful for search engine responses.

BACK TO ALL SERVICES AND FEATURES

Web Crawlers

In order for search engines to provide instant, relevant results to a search query, they must first gather that information from across countless websites. To accomplish this massive task, they use a web crawler program. The web crawler accesses a website and extracts information from the initial page, then follows all the links on that page to read and interpret every other page on the site. By stepping through all the links on a site, they eventually gather and process all the relevant information that a web search might use. This process is completed across the entire Internet and repeated regularly to keep up with changes and additions.

User-agents and URL pages

Since some pages on a website are more relevant than others for search engines, the site designer can insert a file called robots.txt that provides selective information to the web crawler, which is also known as a user-agent. The robots.txt file contains lines that can address all user-agents at once, or it can specify certain ones. The instructions in the robots.txt file then direct the user-agent to ignore specific URL directories or individual webpages. The file can also direct the user-agent to delay the search by a specific amount of time before continuing.

Format and Syntax

Each segment of the robots.txt file contains lines that are read by the user-agents for specific directions. They start with a line that specifies one user-agent, or signifies that it is addressing them all:

User-agent: *

Then, if any delays are desired, the next line tells the user-agent the amount of time to wait before indexing:

Crawl-delay: 100

This line is followed by the specific folders or files to disallow:

Disallow: /subfolder1
Disallow: /subfolder2/page1.htm
Disallow: /mypage.htm

The user-agent line can utilize the wildcard * symbol to address all search engines, or it can choose specific ones. Each user agent can have its own set of commands in the robots.txt file, separated by a blank line. The disallow command can also use * as a wildcard and $ to indicate all files with the specified ending.

The name of the robots.txt file is all lowercase, and it must be placed in the website root directory. To view a websites robots.txt file, if it exists, open a browser, and for the URL enter:

sitename.com/robots.txt

You can view the robots.txt file for any site on the Internet using this syntax.

Examples

User-agent: *
Disallow: /search/
Disallow: /trucks/ranger/index.amp/
Disallow: /trucks/ranger/index.amp
Disallow: /articles/
Disallow: /akamai/
Disallow: /synbase/bev*
Disallow: /synbase/bev/*


User-agent: *
Disallow: /cart
Disallow: /checkout
Disallow: /my-account
Disallow: /cgi-bin
Disallow: /search
Disallow: /login
Disallow: /userHeaderInfo?store=*
Disallow: /locations/store?storeID=*
Disallow: /miniCart/SUBTOTAL
Disallow: /c/getDayPartMessage
Disallow: /taco-gifter/api

By utilizing the robots.txt file, a web programmer can increase the visibility of pertinent data to various search engines, improving placement and responses.

BACK TO ALL SERVICES AND FEATURES
Share by: