Today we take a deep dive into the technical aspects of Robots.txt, Google’s crawlers, what you can do as the Webmaster to control who crawls your site and how you maintain and index the creepy crawlers of the wide-wide web!
Voices of Search arms SEOs with the latest news and insights they need to navigate the ever changing landscape of Search Engine Optimization and Content Marketing. From the heart of Silicon Valley, Searchmetrics’ CEO Jordan Koene delivers actionable insights into using data to navigate the topsy-turvy world now being created by Google, Apple and other search giants.
Ben: Welcome back to the Voices of Search podcast, I’m your host, Benjamin Shapiro. And in this podcast, we’re going to discuss the hottest topics in the ever changing world of search engine optimization. This podcast is brought to you by the marketing team at Searchmetrics, we’re an SEO and content marketing platform that helps enterprise scale businesses monitor their online presence and make data driven decisions. If you’re looking to understand how you can optimize your content, understand what topics you need to cover, or how to ensure that your writers produce effective posts, go to Searchmetrics.com for a free tour of our platform.
Ben: Joining us today is Jordan Koene who is both a world-renowned SEO strategist and the CEO of Searchmetrics. Today we’re going to chat about how to monitor and deal with technical issues related to your robots.txt file. Fun stuff. Jordan welcome back to the Voices of Search Podcast.
Jordan: Thanks Ben. Looking forward to the topic today.
Ben: Yeah, me too. So, I need your help, I got an alert from Google Search Console this week, for one of my consulting websites, saying that I had a handful of pages that were indexed though blocked by robots.txt. Which made me think about, what’s the process for identifying and fixing technical issues for a site, so I guess I have a couple of questions for you. Let’s go through them one by one, but first, what does indexed though blocked by robots.txt mean? Two, what do I do to fix it? And three, what’s the right process for companies larger than mine for managing their technical issues, and making sure that they’re resolved quickly?
Ben: So, can we start off by … Can you tell me what’s my problem?
Jordan: Well, absolutely.
Ben: I know, I know, where should I start?
Jordan: So, yeah I can certainly tell you what the problem is, and in some cases this could be a problem, in other cases, maybe it’s not a problem, but ultimately, a lot of folks did receive notifications recently from Search Console, in particular, notifying them of index coverages issues, and index coverage is really one of the status elements that Google is constantly monitoring on your website, and what it technically means for our listeners, is, “Am I serving up a page that can be indexed within Google?” “Can this page in its current form, in its structure, in its substance, be indexed by Google?”
Jordan: And, there’s a lot of different criteria that Google looks at, in order for something to be indexed, and so hence, the index coverage is looking at, “Well, out of all your pages, is there a certain issue or restrictions or warnings or exclusions that are preventing this page from being indexed?
Ben: Okay. So, what I noticed, when I went into Google Search Console, was that I had 31 issues and all of them related to pages that were tagged. So, BenJShap. com, question mark, tag, equals, some sort of term that I’ve used to tag my content. Is there a way to unblock my robots.txt file, and maybe, just for the people that are as inexperienced as I am, what exactly is it, a robots.txt file, and how do you fix it?
Jordan: So, the situation that you’re encountering, when it comes to index coverage, is that you are specifically notifying Google, that you would like these particular pages blocked, or more likely this particular directory of pages blocked, by using what’s called your robots.txt file. And so, what your robots.txt file is, it is a section of your site, that you put up that notifies Google of various requirements on your site.
Jordan: And so, you can do what is called a “disallow” which says, “I do not want Google, to be crawling, or accessing this particular content.” And more likely than not, you are doing this … You are disallowing these tags, which is probably a really good practice, because you typically don’t want Google crawling tags and monitoring the tags on your blog. Those are typically low quality type pages, because they’re like a collection of dates, or a collection of very unspecified categories, or topics. And so, for that reason, you typically avoid having Google crawl and index those particular pages.
Ben: Okay, so it sounds like I don’t have anything to worry about on my side, specifically, the pages that Google would be crawling, let’s say the search engine optimization tag, I’d rather have Google crawl the actual pages that are related to search engine optimization as opposed to the aggregate page, where they’re collected and tagged.
Jordan: Correct. Now, just for everyone to know, a robots.txt file is something that is owned and controlled by you as a webmaster, so you have the full right and authority to say, “Yes, go and access this.” Or, “No, don’t access this.” And you can … There’s a bunch of other criteria, there’s specifying the user agent, so for example, what type of crawlers can come to my site, do I allow Google? For big brands, sometimes they allow competitors from being able to come to their site. And so, there’s a bunch of different rules and rule sets in here, but the important piece that I think everyone needs to understand is that, this is something that you can control, and you have the full authority to control it.
Ben: So, let’s talk about how somebody can do that. Do you have the right to go decide who’s crawling your site? And you can figure out what’s getting indexed and what isn’t? And how do you go and manipulate the robots.txt file?
Jordan: Taking one quick step back, the idea here, is around how we maintain and manage index, and coverage, and crawl, on our website. And, the funny thing is, and I think this is where things get a little tricky for our listeners, is that there’s a variety of different ways to do this. And, right now we’re talking about one of them, which is using your robots.txt file, and to answer your question specifically, you can just go in and manage this particular page on your website.
Jordan: It is like any other file on your website, and for the webmasters, it’s just a file that’s saved on your database, and is in your repository or in your CMS, and you can go in, you can manipulate, you can update the file, and then publish it. And Google’s going to constantly check that particular page. Now, there are various ways to manage this thing, so if you have a Word Press website, Word Press has plugins or different features where you can update or change the robots.txt file.
Jordan: It’s a very primitive page, it doesn’t have much in it, it’s a white page, with a bunch of instructions. That’s really all it is. But, the point that I’m trying to make here in a sense, it’s one of many tactics or solutions that you can use to maintain and control crawlers in particular, Google’s crawler.
Ben: Okay. So, it sounds like there’s a little bit of Googling to try to figure out exactly where your robots.txt file is, depending on what platform you’re using on. But, really all you’re doing is amending the text file, and saving it the next time Google or any other crawler gets that page, they’re rereading the rules and that happens relatively quickly.
Jordan: Yeah, all of these files live in the same directory, the same place, so if you go to say, Google.com slash robots.txt. If you go to that exact URL, you’re going see all the allows, disallows, and the various other commands that Google has, within their robots.txt. And so, and all websites typically have this file. If you don’t have it, you should add it.
Ben: Okay, so you can find your robots.txt file, by going to Google dot com slash robots.txt? Is that the link?
Jordan: Yes, so website name, and then slash robots.txt.
Ben: Okay, website name slash robots.txt. And, then you can find that file in your directory and amend it however you see fit. You mentioned that not posting tags is a best practice, are there any other best practices related to the robots.txt file that everyone should be aware of? Maybe, obviously disallowing people to follow your shopping cart and things like that seems like a table stakes.
Jordan: Yeah, there’s a variety of different practices and different websites, and webmasters use different strategies and I think for our listeners here, it’s really important, especially for SEOs, to be thinking about what is the priority and structure by which I want to be preventing Google from crawling certain content.
Jordan: Because by and large, this is a tactic to maintain a high level of integrity with Google, how do I ensure that Google is only accessing the highest quality best content on my website? And these tools, allow you to control and move those levers, and so to get to the point here. You can block certain directories, specific pages if you really wanted to, a very common practice is blocking URL parameters. For those of you who know, a URL parameter often is an appendix to a URL, so that you can show a particular refinement, or particular sub experience within the page, you often want to block those, because it’s not a new page, it’s just the color red for a pair of particular shoes.
Jordan: And so, there are a collection of different experiences and elements that can change within your URL structure, that you may want to block and prevent Google from crawling.
Ben: Okay. So, let me ask you a similar question in a different way. It sounds like it’s a very powerful tool where you can select what type of content Google, or other search engines are crawling. What are the best resources that SEOs can look for, to understand how they can best manipulate their robots.txt files?
Jordan: Yeah, so, Google has a great help center, support dot Google dot com slash webmasters, this is your best place to start. And they have a full section dedicated to managing robots.txt and really, they have a whole section around how to control and maintain crawl and index of your website. Google also has in this very same section, a set of webmaster guidelines, and these webmaster guidelines can also help you make these decisions and give you some guidance on how to use this.
Jordan: Ultimately, I think one of the most important things, especially for the folks listening that are working on enterprise level SEO, really critical component here is, working with your development and engineering teams to create a specific set of rules or criteria on how you’re going to manage pages with Google. Because, robots.txt is kind of like a blunt force object, it is very direct, and very single handed, in the way that it imposes requirements on what Google can and can’t do.
Jordan: There are other ways that are maybe a little less aggressive, and I often talk to SEOs and webmasters alike about this particular topic which is, how aggressive do you need to be here in this robots.txt, the best way to disallow or block contents and Googles.
Ben: Okay, so let’s actually take a step back, now that we’re talking about enterprise companies, and talking about, first off, how can they evaluate when they have a technical issue, whether they resolve it with robots.txt files, or another way, what’s the tools that you use to understand when you’re having a technical issue?
Jordan: Yeah, so the first place is obviously, Google Search Console, this is the free tool that’s provided by Google to help you monitor and check your website. But then, there’s also a variety of different tools out there, most notably for in house SEOs, or in house development teams, there’s tools like Splunk and other crawl monitoring tools that help you understand where is Google going, where is Google crawling. There’s also enterprise SEO tools like ourselves, and others where we can go in and crawl and help you determine where particular issues are occurring, so not a monitoring piece as before, but actually telling you, “Hey, you have a problem here, you have a problem there, you should go fix these things.” And oftentimes, especially like with our tool, it’s something where you’re consistently monitoring, so you’re able to identify or find these particular issues, prior to say, Google may be implementing a change, or enforcing that particular change on your site.
Jordan: So, we have one tool called Visibility Guard, which actually tracks your site multiple times per day, to identify particular crawl issues that you might have.
Ben: Let’s dig a little deeper there, for the people who are Searchmetrics customers that are listening to this podcast or people that are interested in our services, you mentioned the Visibility Guard? Or if you’re using another third party tool, to highlight that you have a technical issue, what’s the process for figuring out what the technical issue is, and what you can do with it?
Jordan: So, the first thing here, is just setting up some good practices, practices and processes around monitoring, because many of these issues don’t just … They don’t just present themselves and it’s easy for you to see them, say in your analytics data. It’s something that you have to be constantly monitoring and keeping track of. Which is, why Google Search Console sent you this notification, earlier this week, Google has been updating kind of their messaging and their monitoring and so, a lot of folks received these notifications and you can actually control that, you can go into your settings and you can control when you get notifications, or when you get updates. But, the idea here, the principle idea here is, the right one, which is, you need to set up a cadence and a process to monitor this on a relatively frequent basis.
Jordan: For smaller websites, say under 10 thousand pages, this is maybe something that you’re looking at once a week, once a month. For big enterprise sites, this is something that you’re looking at every day, and if you have even a larger organization where you still have a dedicated developed dev ops team, or someone who’s monitoring the status of your website, you may even want to set up particular alerts and requirements with that development team, so that you as the SEO are getting notifications directly from your dev ops team.
Jordan: And I think that’s a really unique opportunity that many SEOs never even think about, because this is a team that sits in some dark corner in engineering, but really working with those folks, always pays dividends because they’re seeing the problem first hand, they’re the first group notified when there’s a particular crawl issue or access issue with your website.
Ben: Okay. So, essentially if you’re large enough to have a dev ops team you should buddy up with them and make sure that you keep them on your Christmas card list, and we talked about getting ready for the holidays, maybe we should add that to last week’s episode. When you’re a smaller site, it sounds like you’re checking in with a regular cadence on your Google Search Console, to see and setting up alerts or, monitoring when Google Search Console sends you alerts and making sure that you’re reactive, as much as you can.
Jordan: That’s correct, yup.
Ben: Is there any other way, other than being reactive, is there … Are there proactive steps that you can take?
Jordan: That’s a great question, and it’s a tricky one, but yes, there are ways for businesses to be proactive in this place. One of them is, creating a strong cadence between how your site maps are built, and what it is that you’re controlling and restricting within robots.txt, or another tactic that’s often used is an [inaudible 00:17:31] directive, which is a no index, no follow, no index follow type status. But, using your site maps as a benchmark of what pages you want to have indexed, often then helps your strategy and directs your strategy on what pages you don’t want to have indexed.
Jordan: And so, that ultimately, is one of the proactive things that I think a lot of webmasters and a lot of SEOs, they forget about site maps, they say, oh I did my site maps, they’re all set up, they’re all sitting there, but we don’t go back and do a thorough audit or thorough review, and then reflect on what are the priorities of content that I have on my site, and how do I use that to have checks and balances with what I want to block or restrict from Google.
Jordan: Simultaneously, the other thing that you can always do, that is, is proactive, is creating a lot of these processes and expectations with other teams, like dev ops, or engineering, or using a tool like ours, like the Searchmetrics Visibility Guard, that proactively monitors, because at the end of the day, this is an exercise about monitoring, than it is about being able to fix a triage, something to prevent it from happening.
Jordan: At the end of the day, it’s always about having good clean code, having good clean structure and hierarchy to your website, but those are much bigger themes that you can’t just make a bold statement that you need to be proactive there, those have their own set of strategy and requirements based on business and how the business is set up online.
Ben: What I’m hearing is that, there is infrastructure that when you set it up in a logical way, can help you basically avoid a lot of the errors that can come up down the road, if you have a logical structure and what you’re submitting to Google, through your site map, helps you avoid errors, you’re less likely to have to be reactive down the road, and then the other way to be proactive, is setting up and using monitoring tools to make sure that when there is an issue, it is highlighted quickly so you can address it. And then, the third thing is, like I said, make sure that you’re buddying up with your dev ops team, or whatever your engineering resources are, and set the expectations that there are going to be times when you need the website to be updated and fixed, to keep your search visibility as high as possible.
Jordan: Yup, yup.
Ben: So, those are the, basically the three ways to stay on top of dealing with any website errors that may come up, make sure that you have a logical site structure, and we’re going to get into site structure in a future episode, two, make sure that you’re monitoring the performance of your site, and three make sure that you have the resources ready to be able to address any technical issues down the road.
Ben: Okay. Any other advice for the SEOs listening, in terms of when you run into technical issues, and how you can deal with them?
Jordan: Yeah, so, there’s the whole practice around assessing your content and ensuring that, that content is in a state that actually creates the highest value, so, one of the things that we do a lot of, is help SEOs understand whether they have little or no original content, and whether or not, that’s something that should or shouldn’t be indexed. And some of these things are very hard decisions, right?
Jordan: So, for example, you can have websites that are repurposing reviews, or review content, from other sources, and is that something that you should independently index and submit into Google, even though it’s coming from another source, a third party source, that is also probably showing this to Google? Those are the kinds of hard decisions that a lot of webmasters and SEOs are facing which is, do I actually want to put this content in Google, it can be very useful from a supplementary standpoint in helping other content, but independently, is it something that we should avoid?
Jordan: And so, that’s I think one of the principle areas where I highly recommend SEOs to think hard and long about what is the value, and then being true advocates of only providing the highest quality, useful content to Google, and there’s a lot of examples of this, there’s automated content that’s often on your sites, there’s little to no value, sometimes you have hidden content, or cloaked content on your site, or scraped content that’s being provided or surfaced on your site. There’s various examples, and different business and different categories face different challenges, but ensuring that what is being indexed is your high quality useful content or URL, is the ultimate benefit behind using these various tactics to block or disallow content from Google.
Ben: Okay. I think that’s great feedback and it sounds like making sure that you’re feeding Google the best content that you have, in a logical structure is one way to avoid errors, and then being able to evaluate your site, and being ready to move quickly when you do run into them that seems to be the theme of dealing with technical issues.
Ben: Okay, that wraps up this episode of the Voices of Search Podcast. Thanks for listening to my conversation with Jordan Koene, the CEO of Searchmetrics Inc. We’d love to continue the conversation with you, so if you’re interested in contacting Jordan, you can find links to his bio in our show notes, or you can shoot him an SEO related tweet to J-T Koene, that’s J-T K-O-E-N-E on Twitter. If you have any general marketing questions, or if you want to talk about podcasting, you can find my contact information in our show notes, or you can send me a tweet at Ben J Shap, that’s B-E-N-J-S-H-A-P, if you’re interested in learning more, about how to use search data, to boost your organic traffic, online visibility, or to gain competitive insights, head over to Searchmetrics.com for a free tour of our platform. If you like this podcast, and you want a regular stream of SEO and content marketing insights in your feed, hit the subscribe button in your podcast app. Lastly, if you’ve enjoyed this show, and you’re feeling generous, we would be honored for you to leave a review in the Apple iTunes Store, it’s a great way for us to share our learnings about SEO and content marketing.
Ben: Okay, that’s it for today, but until next time, remember the answers you’re looking for are always in the data.