This post was originally published on White.net
Removing pages from your website? Going through a site redesign or migration but not sure you have all the URLs on your website? It’s an issue that we will all go through at some point in our SEO life. I have been involved in a lot of projects recently where I’ve needed to find all the URLs that were on a website, and it can be a pain to do! However, I’ve now managed to get it down to 6 easy steps, and I wanted to share them with you.
Step 1. Crawl your website
This is an obvious one in my opinion. Looking at your website and gathering all the URLs that you can find should be easy right? Well, maybe if you only have a 10’s of pages, but if you are at enterprise level then this isn’t so easy.
Most of you will already know that you can use tools such as Xenu and ScreamingFrog, but if your site is at enterprise level, these may not be robust enough. In this case you could turn to DeepCrawl, which specialises in crawling large websites.
Once you have run the crawl, place those URLs into a spreadsheet on a tab labelled ‘Website Crawl’. You will also want to start a ‘Master List’ so that you have a single URL list, so go ahead and create that too. We will be constantly adding to this list as we go through the steps.
Step 2. Visit your Analytics Package
Reviewing your analytics data is extremely important for all levels of marketing, but many people don’t realise that this is a good place to see what pages you have on your website.
We are looking for all the pages on the website, so you need to head over to Content or Behaviour in your analytics package, change the date so you have a range of at least 18 months, and hit the download button. If you have too many URLs for the download or it is taking a considerable amount of time, you may want to investigate using the API.
Once you have these URLs create a new tab called ‘Analytics Data’ and place your URLs here, you then want to add a copy to the bottom of the list in the ‘Master List’ tab. Now to step 3.
Step 3. XML Sitemaps
This is another place that is commonly forgotten when looking for URLs. The XML sitemap should, ideally, be the place to find the most up to date version of all URLs for your website. After all, it is the place you are asking the search engines to look to help improve your visibility.
Luckily, with an XML file you can open it straight into Excel, so do that now. You will need to do some formatting to remove the unnecessary tags that accompany the XML sitemap, but once you’ve done this it will leave you with a list of URLs. Copy this list into your original spreadsheet onto a worksheet called ‘Sitemaps’, and then copy it into the ‘Master List’ tab.
Step 4. Get a list of Your Most linked pages
Everyone likes a good link right? So you wouldn’t want to do anything that may lose an influential link. The next step is to download a list of the URLs that are your most linked to pages. You will probably need to do this from multiple tools if you have access to them, but at the very minimum you should download a list of URLs from GWT.
Tools to download most linked to pages from include:
- Google Webmaster Tools
- Bing Webmaster Tools
- Open Site Explorer Top Pages
Once you have these URLs, collate them and add them to a new tab called ‘Most Linked Pages’, and add a copy to the bottom of the URL list on the ‘Master List’ tab. Now move on to the fifth step.
Step 5. Scraping the SERPs
So far. we have used numerous tools to get all our URLs, but we haven’t checked the search engines! So let’s do this now. You will need a scraping extension for your browser. I use Scrape Similar for Chrome.
Go to the search engine of choice, you will want to check at least Google & Bing in the UK, and type in site:domain.com. Now, change the settings of SERP page to show 100 results (we want to do this as quickly as possible!) – this can be done by going to your account settings and changing the view from 10 to 100. You may also need to remove Instant Search from the tick box options.
Now you should be able to see 100 results from your domain. Hover over the first result title tag, right click and select “scrape similar”. This should bring up another dialog box with the list of the URLs from the first 100 SERPs and provide you with the option to either put it straight into excel or Google Drive. Either option is good at this point. You will need to go through all the listings that the search engines have returned – this could take a bit of time! There might be a quicker way to do this, and if you know one I would be happy to hear about it in the comments below.
Once you have gone through the results and collated the URLs, put them in a new tab called ‘SERP Scrapped URLs’ and add the list to the bottom of the URLs you have gathered from Steps 1-4 in the ‘Master List’ tab.
Step 6. De-dupe & Check
Wow, you have come a long way and more than likely have a lot of URLs within your spreadsheet. Most of those are likely to be duplicated, at least we hope they are as it will mean you are doing a good job. In Excel there is a feature that allows you to remove all duplicates and leave you with a unique list of URLs. This feature is found in Data > Remove Duplicates. Go ahead and do this.
Hopefully this will leave you with a good amount of URLs. Now for the final step, copy the list of URLs and run them through a crawler, I’d use ScreamingFrog to allow you to check the HTTP status of those URLs. Now you have the status codes, copy this list back into your spreadsheet, which will leave you with a list of as complete as possible URLs with status codes. Now you are done!
If you have completed all six steps, then you should have a pretty thorough list of the URLs that are located on your website. I hope this was helpful and provides some structure to finding all the URLs that you need. Have I missed anything out? Is there a quicker more reliable way of getting all the URLs? I would love to hear your thoughts in the comments below or over on twitter @danielbianchini.