Category : Google News

Google News Inbound Marketing Internet Marketing Search Engine Optimization SEO News

Robots.Txt: A Beginners Guide

Robots.txt is:

A simple file that contains components used to specify the pages on a website that must not be crawled (or in some cases must be crawled) by search engine bots. This file should be placed in the root directory of your site. The standard for this file was developed in 1994 and is known as the Robots Exclusion Standard or Robots Exclusion Protocol.

Some common misconceptions about robots.txt:

  • It stops content from being indexed and shown in search results.

If you list a certain page or file under a robots.txt file but the URL to the page is found in external resources, search engine bots may still crawl and index this external URL and show the page in search results. Also, not all robots follow the instructions given in robots.txt files, so some bots may crawl and index pages mentioned under a robots.txt file anyway.  If you want an extra indexing block, a robots Meta tag with a ‘noindex’ value in the content attribute will serve as such when used on these specific web pages, as shown below:

<meta name=“robots” content=“noindex”>

Read more about this here.

  • It protects private content.

If you have private or confidential content on a site that you would like to block from the bots, please do not only depend on robots.txt. It is advisable to use password protection for such files, or not to publish them online at all.

  • It guarantees no duplicate content indexing.

As robots.txt does not guarantee that a page will not be indexed, it is unsafe to use it to block duplicate content on your site. If you do use robots.txt to block duplicate content make sure you also adopt other foolproof methods, such as a rel=canonical tag.

  • It guarantees the blocking of all robots.

Unlike Google bots, not all bots are legitimate and thus may not follow the robots.txt file instructions to block a particular file from being indexed. The only way to block these unwanted or malicious bots is by blocking their access to your web server through server configuration or with a network firewall, assuming the bot operates from a single IP address.

robots-txt-beginners-guide

Uses for Robots.txt:

In some cases the use of robots.txt may seem ineffective, as pointed out in the above section. This file is there for a reason, however, and that is its importance for on-page SEO.

The following are some of the practical ways to use robots.txt:

  • To discourage crawlers from visiting private folders.
  • To keep the robots from crawling less noteworthy content on a website. This gives them more time to crawl the important content that is intended to be shown in search results.
  • To allow only specific bots access to crawl your site. This saves bandwidth.
  • Search bots request robots.txt files by default. If they do not find one they will report a 404 error, which you will find in the log files. To avoid this you must at least use a default robots.txt, i.e. a blank robots.txt file.
  • To provide bots with the location of your Sitemap.  To do this, enter a directive in your robots.txt that includes the location of your Sitemap:
      Sitemap: http://yoursite.com/sitemap-location.xml

You can add this anywhere in the robots.txt file because the directive is independent of the user-agent line.  All you have to do is specify the location of your Sitemap in the sitemap-location.xml part of the URL. If you have multiple Sitemaps you can also specify the location of your Sitemap index file.  Learn more about sitemaps in our blog on XML Sitemaps.

Examples of Robots.txt Files:

There are two major elements in a robots.txt file: User-agent and Disallow.

User-agent: The user-agent is most often represented with a wildcard (*) which is an asterisk sign that signifies that the blocking instructions are for all bots. If you want certain bots to be blocked or allowed on certain pages, you can specify the bot name under the user-agent directive.

Disallow: When disallow has nothing specified it means that the bots can crawl all the pages on a site. To block a certain page you must use only one URL prefix per disallow. You cannot include multiple folders or URL prefixes under the disallow element in robots.txt.

The following are some common uses of robots.txt files.

To allow all bots to access the whole site (the default robots.txt) the following is used:

User-agent:*
Disallow:

To block the entire server from the bots, this robots.txt is used:

User-agent:*
Disallow: /

To allow a single robot and disallow other robots:

User-agent: Googlebot
Disallow:

User-agent: *

 Disallow: /

To block the site from a single robot:

User-agent: XYZbot
 Disallow: /

To block some parts of the site:

User-agent: *
 Disallow: /tmp/
 Disallow: /junk/

Use this robots.txt to block all content of a specific file type. In this example we are excluding all files that are Powerpoint files. (NOTE: The dollar ($) sign indicates the end of the line):

User-agent: *
 Disallow: *.ppt$

To block bots from a specific file:

User-agent: *
 Disallow: /directory/file.html

To crawl certain HTML documents in a directory that is blocked from bots you can use an Allow directive. Some major crawlers support the Allow directive in robots.txt. An example is shown below:

User-agent: *
 Disallow: /folder/
 Allow: /folder1/myfile.html

To block URLs containing specific query strings that may result in duplicate content, the robots.txt below is used. In this case, any URL containing a question mark (?) is blocked:

User-agent: *
 Disallow: /*?

For the page not to be indexed:Sometimes a page will get indexed even if you include in the robots.txt file due to reasons such as being linked externally. In order to completely block that page from being shown in search results, you can include robots noindex Meta tags on those pages individually. You can also include a nofollow tag and instruct the bots not to follow the outbound links by inserting the following codes:

<meta name=“robots” content=“noindex”>

For the page not to be indexed and links not to be followed:

<meta name=“robots” content=“noindex,nofollow”>

NOTE: If you add these pages to the robots.txt and also add the above Meta tag to the page, it will not be crawled but the pages may appear in the URL-only listings of search results, as the bots were blocked specifically from reading the Meta tags within the page.

Another important thing to note is that you must not include any URL that is blocked in your robots.txt file in your XML sitemap. This can happen, especially when you use separate tools to generate the robots.txt file and XML sitemap. In such cases, you might have to manually check to see if these blocked URLs are included in the sitemap. You can test this in your Google Webmaster Tools account if you have your site submitted and verified on the tool and have submitted your sitemap.

Go to Webmaster Tools > Optimization > Sitemaps and if the tool shows any crawl error on the sitemap(s) submitted, you can double check to see whether it is a page included in robots.txt.

Read More
Google News

Google Project Fi, Google’s New Cellphone / Wireless Service, Has Launched

Google has now launched its anticipated own cellphone service called Fi. Details are up here on the Google Blog, as well as a new Fi website where people who own a Nexus 6 phone can request an invite.

A new way to say hello

Project Fi is a program to deliver a fast, easy wireless experience in close partnership with leading carriers, hardware makers, and google users.

Google’s also released a video about the service:

In today’s mobile world, fast and reliable connectivity is almost second nature. But even in places like the U.S., where mobile connections are nearly ubiquitous, there are still times when you turn to your phone for that split-second answer and don’t have fast enough speed. Or you can’t get calls and texts because you left your phone in a taxi (or it got lost in a couch cushion for the day). As mobile devices continually improve how you connect to people and information, it’s important that wireless connectivity and communication keep pace and be fast everywhere, easy to use, and accessible to everyone.

That’s why today we’re introducing Project Fi, a program to explore this opportunity by introducing new ideas through a fast and easy wireless experience. Similar to our Nexus hardware program, Project Fi enables us to work in close partnership with leading carriers, hardware makers, and all of you to push the boundaries of what’s possible. By designing across hardware, software and connectivity, we can more fully explore new ways for people to connect and communicate. Two of the top mobile networks in the U.S.—Sprint and T-Mobile—are partnering with us to launch Project Fi and now you can be part of the project too.

For more details Please visit : Google Project Fi

Project Fi takes a fresh approach to plans and pricing to make decisions simple and help you save

Read More
1 2 3

ARE YOU READY? GET IT NOW!

Stay current on Digital marketing trends

Your Information will never be shared with any third party.