Updated by Richie Lauridsen & Allison Hahn on February 19, 2020.Originally published on May 11, 2015. So, I admit it: When we started looking at our own blog traffic, we realized this was one of the most historically popular blog posts on the Seer domain.
After a brief moment of reflection and a swell of enthusiasm for the ever-present greatness of the Screaming Frog SEO Spider, a tool that’s been a loyal companion in our technical SEO journey, we realized we were doing a disservice–both to our readers and to the many leaps forward from the great Screaming Frog.
Though this original guide was published in 2015, in the years since, Screaming Frog has evolved to offer a whole suite of new features and simplified steps to conduct technical audits, check a site’s health, or simply get a quick glimpse of info on a selection of URLs.
Below, you’ll find an updated guide to how SEOs, PPC professionals, and digital marketing experts can use the tool to streamline their workflow.
To get started, simply select what it is that you are looking to do:. When starting a crawl, it’s a good idea to take a moment and evaluate what kind of information you’re looking to get, how big the site is, and how much of the site you’ll need to crawl in order to access it all.
Sometimes, with larger sites, it’s best to restrict the crawler to a sub-section of URLs to get a good representative sample of data.
This keeps file sizes and data exports a bit more manageable.
We go over this in further detail below. For crawling your entire site, including all subdomains, you’ll need to make some slight adjustments to the spider configuration to get started.
By default, Screaming Frog only crawls the subdomain that you enter. Any additional subdomains that the spider encounters will be viewed as external links.
- In order to crawl additional subdomains, you must change the settings in the Spider Configuration menu. By checking ‘Crawl All Subdomains’, you will ensure that the spider crawls any links that it encounters to other subdomains on your site.
- In addition, if you’re starting your crawl from a specific subfolder or subdirectory and still want Screaming Frog to crawl the whole site, check the box marked “Crawl Outside of Start Folder.”.
- By default, the SEO Spider is only set to crawl the subfolder or subdirectory you crawl from forwards. If you want to crawl the whole site and start from a specific subdirectory, be sure that the configuration is set to crawl outside the start folder.
To save time and disk space, be mindful of resources that you may not need in your crawl. Websites link to so much more than just pages.
- If you wish to limit your crawl to a single folder, simply enter the URL and press start without changing any of the default settings.
- If you’ve overwritten the original default settings, reset the default configuration within the ‘File’ menu. If you wish to start your crawl in a specific folder, but want to continue crawling to the rest of the subdomain, be sure to select ‘Crawl Outside Of Start Folder’ in the Spider Configuration settings before entering your specific starting URL.
- If you wish to limit your crawl to a specific set of subdomains or subdirectories, you can use RegEx to set those rules in the Include or Exclude settings in the Configuration menu.
- In this example, we crawled every page on seerinteractive.com excluding the ‘about’ pages on every subdomain.
PPC & Analytics
Go to Configuration > Exclude; use a wildcard regular expression to identify the URLs or parameters you want to exclude. Test your regular expression to make sure it’s excluding the pages you expected to exclude before you start your crawl:. In the example below, we only wanted to crawl the team subfolder on seerinteractive.com.
Again, use the “Test” tab to test a few URLs and ensure the RegEx is appropriately configured for your inclusion rule. This is a great way to crawl larger sites; in fact, Screaming Frog recommends this method if you need to divide and conquer a crawl for a bigger domain.
Once the crawl is finished, go to the ‘Internal’ tab and filter your results by ‘HTML’. Click ‘Export’, and you’ll have the full list in CSV format.
Running the spider with these settings unchecked will, in effect, give you a list of all of the pages in your starting folder (as long as they are not orphaned pages).
There are several different ways to find all of the subdomains on a site. Use Screaming Frog to identify all subdomains on a given site. Navigate to Configuration > Spider, and ensure that “Crawl all Subdomains” is selected. Just like crawling your whole site above, this will help crawl any subdomain that is linked to within the site crawl. However, this will not find subdomains that are orphaned or unlinked.
Use Google to identify all indexed subdomains. By using the Scraper Chrome extension and some advanced search operators, we can find all indexable subdomains for a given domain.
Start by using a site: search operator in Google to restrict results to your specific domain. Then, use the -inurl search operator to narrow the search results by removing the main domain. You should begin to see a list of subdomains that have been indexed in Google that do not contain the main domain.
Use the Scraper extension to extract all of the results into a Google Sheet. Simply right-click the URL in the SERP, click “Scrape Similar” and export to a Google Doc. In your Google Doc, use the following function to trim the URL to the subdomain:.
Essentially, the formula above should help strip off any subdirectories, pages, or file names at the end of a site.
This formula essentially tells sheets or Excel to return what is to the left of the trailing slash.
The start number of 9 is significant, because we are asking it to start looking for a trailing slash after the 9th character. This accounts for the protocol: https://, which is 8 characters long.
De-duplicate the list, and upload the list into Screaming Frog in List Mode–you can manually paste the list of domains, use the paste function, or upload a CSV.
Enter the root domain URL into tools that help you look for sites that might exist on the same IP or search engines designed especially to search for subdomains, like FindSubdomains.
Create a free account to login and export a list of subdomains. Then, upload the list to Screaming Frog using List Mode. Once the spider has finished running, you’ll be able to see status codes, as well as any links on the subdomain homepages, anchor text and duplicate page titles among other things.
Screaming Frog was not originally built to crawl hundreds of thousands of pages, but thanks to some upgrades, it’s getting closer every day.
The newest version of Screaming Frog has been updated to rely on database storage for crawls. In version 11.0, Screaming Frog allowed users to opt to save all data to disk in a database rather than just keep it in RAM. This opened up the possibility of crawling very large sites for the first time. In version 12.0, the crawler automatically saves crawls to the database.
This allows them to be accessed and opened using “File > Crawls” in the top-level menu–in case you panic and wonder where the open command went! While using database crawls helps Screaming Frog better manage larger crawls, it’s certainly not the only way to crawl a large site.
First, you can increase the memory allocation of the spider.
Additionally, you can also access queued URLs. This may give you insight about any additional parameters or rules you may want to exclude in order to crawl a large site.
In some cases, older servers may not be able to handle the default number of URL requests per second.
In fact, we recommend including a limit on the number of URLs to crawl per second to be respectful of a site’s server just in case.
It’s best to let a client know when you’re planning on crawling a site just in case they might have protections in place against unknown User Agents.
/descendant::h3[position() >= 0 and position() <= 10]
On one hand, they may need to whitelist your IP or User Agent before you crawl the site.
The worst case scenario may be that you send too many requests to the server and inadvertently crash the site. To change your crawl speed, choose ‘Speed’ in the Configuration menu, and in the pop-up window, select the maximum number of threads that should run concurrently.
From this menu, you can also choose the maximum number of URLs requested per second. If you find that your crawl is resulting in a lot of server errors, go to the ‘Advanced’ tab in the Spider Configuration menu, and increase the value of the ‘Response Timeout’ and of the ‘5xx Response Retries’ to get better results.
Although search bots don’t accept cookies, if you are crawling a site and need to allow cookies, simply select ‘Allow Cookies’ in the ‘Advanced’ tab of the Spider Configuration menu.
To crawl using a different user agent, select ‘User Agent’ in the ‘Configuration’ menu, then select a search bot from the drop-down or type in your desired user agent strings.
As Google is now mobile-first, try crawling the site as Googlebot Smartphone, or modify the User-Agent to be a spoof of Googlebot Smartphone.
This is important for two different reasons:.
Crawling the site mimicking the Googlebot Smartphone user agent may help determine any issues that Google is having when crawling and rendering your site’s content.
Using a modified version of the Googlebot Smartphone user agent will help you distinguish between your crawls and Google’s crawls when analyzing server logs.
When the Screaming Frog spider comes across a page that is password-protected, a pop-up box will appear, in which you can enter the required username and password.
Note: Forms-Based authentication should be used sparingly, and only by advanced users.
Meta Data and Directives
The crawler is programmed to click every link on a page, so that could potentially result in links to log you out, create posts, or even delete data.
This will provide you with all of the link locations, as well as the corresponding anchor text, directives, etc.
All inlinks can be a big report. Be mindful of this when exporting. For a large site, this export can sometimes take minutes to run.
//link[contains(@media, '640') and @href]/@href
For a quick tally of the number of links on each page, go to the ‘Internal’ tab and sort by ‘Outlinks’.
View Original HTML and Rendered HTML
Anything over 100, might need to be reviewed. Need something a little more processed? Check out this tutorial on calculating the importance of internal linking spearheaded by Allison Hahn and Zaine Clark.
How to identify all of the pages that include meta directives e.g.: nofollow/noindex/noodp/canonical etc.
Once the spider has finished crawling, sort the ‘Internal’ tab results by ‘Status Code’.
Any 404’s, 301’s or other status codes will be easily viewable.
Upon clicking on any individual URL in the crawl results, you’ll see information change in the bottom window of the program.
By clicking on the ‘In Links’ tab in the bottom window, you’ll find a list of pages that are linking to the selected URL, as well as anchor text and directives used on those links.
How to crawl using a different user-agent
You can use this feature to identify pages where internal links need to be updated.
How to analyze a list of prospective link locations
To export the full list of pages that include broken or redirected links, choose ‘Redirection (3xx) In Links’ or ‘Client Error (4xx) In Links’ or ‘Server Error (5xx) In Links’ in the ‘Advanced Export’ menu, and you’ll get a CSV export of the data.
How to find broken links for outreach opportunities
To export the full list of pages that include broken or redirected links, visit the Bulk Export menu.
How to find duplicate page titles, meta descriptions, or URLs
How to find broken outbound links on a page or site (or all outbound links in general)
After the spider is finished crawling, click on the ‘External’ tab in the top window, sort by ‘Status Code’ and you’ll easily be able to find URLs with status codes other than 200.
Upon clicking on any individual URL in the crawl results and then clicking on the ‘In Links’ tab in the bottom window, you’ll find a list of pages that are pointing to the selected URL.
//a[contains(translate(., 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz'),'seo spider')]/@href
You can use this feature to identify pages where outbound links need to be updated. To export your full list of outbound links, click ‘External Links’ on the Bulk Export tab.
How to find all of the subdomains on a site and verify internal links.
For a complete listing of all the locations and anchor text of outbound links, select ‘All Outlinks’ in the ‘Bulk Export’ menu.
The All Outlinks report will include outbound links to your subdomains as well; if you want to exclude your domain, lean on the “External Links” report referenced above.
After the spider has finished crawling, select the ‘Response Codes’ tab from the main UI, and filter by Status Code.
Because Screaming Frog uses Regular Expressions for search, submit the following criteria as a filter: 301|302|307.
How to crawl an entire site
This should give you a pretty solid list of all links that came back with some sort of redirect, whether the content was permanently moved, found and redirected, or temporarily redirected due to HSTS settings (this is the likely cause of 307 redirects in Screaming Frog).
Sort by ‘Status Code’, and you’ll be able to break the results down by type. Click on the ‘In Links’ tab in the bottom window to view all of the pages where the redirecting link is used.
I want a list of all of the pages on my site
If you export directly from this tab, you will only see the data that is shown in the top window (original URL, status code, and where it redirects to).
To export the full list of pages that include redirected links, you will have to choose ‘Redirection (3xx) In Links’ in the ‘Advanced Export’ menu.
How to check my existing XML Sitemap
This will return a CSV that includes the location of all your redirected links.
To show internal redirects only, filter the ‘Destination’ column in the CSV to include only your domain.
Use a VLOOKUP between the 2 export files above to match the Source and Destination columns with the final URL location.
Still nerding out on technical SEO?
Sample formula:. (Where ‘response_codes_redirection_(3xx).csv’ is the CSV file that contains the redirect URLs and ‘50’ is the number of rows in that file.).
Need to find and fix redirect chains? @dan_shure gives the breakdown on how to do it here.
How to find images that are missing alt text or images that have lengthy alt text
Internal linking opportunities can yield massive ROI–especially when you’re being strategic about the distribution of PageRank & link equity, keyword rankings, and keyword-rich anchors.
Our go-to resource for internal linking opportunities comes down to the impressive Power BI dashboard created by our very own Allison Hahn and Zaine Clark.
Learn more here. After the spider has finished crawling, go to the ‘Internal’ tab, filter by HTML, then scroll to the right to the ‘Word Count’ column.
Sort the ‘Word Count’ column from low to high to find pages with low text content.