Updated by Richie Lauridsen & Allison Hahn on February 19, 2020.Originally published on May 11, 2015. So, I admit it: When we started looking at our own blog traffic, we realized this was one of the most historically popular blog posts on the Seer domain. After a brief moment of reflection and a swell of enthusiasm for the ever-present greatness of the Screaming Frog SEO Spider, a tool that’s been a loyal companion in our technical SEO journey, we realized we were doing a disservice–both to our readers and to the many leaps forward from the great Screaming Frog.
9) Run The SEO Spider In The Cloud With an SSD & Lots of RAM
Though this original guide was published in 2015, in the years since, Screaming Frog has evolved to offer a whole suite of new features and simplified steps to conduct technical audits, check a site’s health, or simply get a quick glimpse of info on a selection of URLs.
- Below, you’ll find an updated guide to how SEOs, PPC professionals, and digital marketing experts can use the tool to streamline their workflow.
- To get started, simply select what it is that you are looking to do:.
- When starting a crawl, it’s a good idea to take a moment and evaluate what kind of information you’re looking to get, how big the site is, and how much of the site you’ll need to crawl in order to access it all.
- Sometimes, with larger sites, it’s best to restrict the crawler to a sub-section of URLs to get a good representative sample of data.
- This keeps file sizes and data exports a bit more manageable.
- We go over this in further detail below.
- For crawling your entire site, including all subdomains, you’ll need to make some slight adjustments to the spider configuration to get started.
- By default, Screaming Frog only crawls the subdomain that you enter.
- Any additional subdomains that the spider encounters will be viewed as external links.
- In order to crawl additional subdomains, you must change the settings in the Spider Configuration menu.
- By checking ‘Crawl All Subdomains’, you will ensure that the spider crawls any links that it encounters to other subdomains on your site.
- In addition, if you’re starting your crawl from a specific subfolder or subdirectory and still want Screaming Frog to crawl the whole site, check the box marked “Crawl Outside of Start Folder.”.
By default, the SEO Spider is only set to crawl the subfolder or subdirectory you crawl from forwards. If you want to crawl the whole site and start from a specific subdirectory, be sure that the configuration is set to crawl outside the start folder.
To save time and disk space, be mindful of resources that you may not need in your crawl. Websites link to so much more than just pages.
What Are The Differences Between Memory & Database Storage?
If you wish to start your crawl in a specific folder, but want to continue crawling to the rest of the subdomain, be sure to select ‘Crawl Outside Of Start Folder’ in the Spider Configuration settings before entering your specific starting URL.
If you wish to limit your crawl to a specific set of subdomains or subdirectories, you can use RegEx to set those rules in the Include or Exclude settings in the Configuration menu. In this example, we crawled every page on seerinteractive.com excluding the ‘about’ pages on every subdomain.
Go to Configuration > Exclude; use a wildcard regular expression to identify the URLs or parameters you want to exclude. Test your regular expression to make sure it’s excluding the pages you expected to exclude before you start your crawl:. In the example below, we only wanted to crawl the team subfolder on seerinteractive.com.
Again, use the “Test” tab to test a few URLs and ensure the RegEx is appropriately configured for your inclusion rule. This is a great way to crawl larger sites; in fact, Screaming Frog recommends this method if you need to divide and conquer a crawl for a bigger domain.
PPC & Analytics
Running the spider with these settings unchecked will, in effect, provide you with a list of all of the pages on your site that have internal links pointing to them.
Once the crawl is finished, go to the ‘Internal’ tab and filter your results by ‘HTML’. Click ‘Export’, and you’ll have the full list in CSV format.
There are several different ways to find all of the subdomains on a site. Use Screaming Frog to identify all subdomains on a given site. Navigate to Configuration > Spider, and ensure that “Crawl all Subdomains” is selected. Just like crawling your whole site above, this will help crawl any subdomain that is linked to within the site crawl. However, this will not find subdomains that are orphaned or unlinked.
Use Google to identify all indexed subdomains. By using the Scraper Chrome extension and some advanced search operators, we can find all indexable subdomains for a given domain. Start by using a site: search operator in Google to restrict results to your specific domain.
Then, use the -inurl search operator to narrow the search results by removing the main domain. You should begin to see a list of subdomains that have been indexed in Google that do not contain the main domain.
Use the Scraper extension to extract all of the results into a Google Sheet. Simply right-click the URL in the SERP, click “Scrape Similar” and export to a Google Doc.
In your Google Doc, use the following function to trim the URL to the subdomain:. Essentially, the formula above should help strip off any subdirectories, pages, or file names at the end of a site.
This formula essentially tells sheets or Excel to return what is to the left of the trailing slash. The start number of 9 is significant, because we are asking it to start looking for a trailing slash after the 9th character. This accounts for the protocol: https://, which is 8 characters long. De-duplicate the list, and upload the list into Screaming Frog in List Mode–you can manually paste the list of domains, use the paste function, or upload a CSV.
What can you do with the SEO Spider Tool?
Enter the root domain URL into tools that help you look for sites that might exist on the same IP or search engines designed especially to search for subdomains, like FindSubdomains.
Create a free account to login and export a list of subdomains. Then, upload the list to Screaming Frog using List Mode. Once the spider has finished running, you’ll be able to see status codes, as well as any links on the subdomain homepages, anchor text and duplicate page titles among other things.
Screaming Frog was not originally built to crawl hundreds of thousands of pages, but thanks to some upgrades, it’s getting closer every day. The newest version of Screaming Frog has been updated to rely on database storage for crawls. In version 11.0, Screaming Frog allowed users to opt to save all data to disk in a database rather than just keep it in RAM.
This opened up the possibility of crawling very large sites for the first time. In version 12.0, the crawler automatically saves crawls to the database.
By deselecting these options in the Configuration menu, you can save memory by crawling HTML only. Until recently, the Screaming Frog SEO Spider might have paused or crashed when crawling a large site.
How To Crawl Large Websites Using The SEO Spider
Now, with database storage as the default setting, you can recover crawls to pick up where you left off. Additionally, you can also access queued URLs. This may give you insight about any additional parameters or rules you may want to exclude in order to crawl a large site.
In some cases, older servers may not be able to handle the default number of URL requests per second. In fact, we recommend including a limit on the number of URLs to crawl per second to be respectful of a site’s server just in case.
It’s best to let a client know when you’re planning on crawling a site just in case they might have protections in place against unknown User Agents.
On one hand, they may need to whitelist your IP or User Agent before you crawl the site. The worst case scenario may be that you send too many requests to the server and inadvertently crash the site.
To change your crawl speed, choose ‘Speed’ in the Configuration menu, and in the pop-up window, select the maximum number of threads that should run concurrently.
From this menu, you can also choose the maximum number of URLs requested per second. If you find that your crawl is resulting in a lot of server errors, go to the ‘Advanced’ tab in the Spider Configuration menu, and increase the value of the ‘Response Timeout’ and of the ‘5xx Response Retries’ to get better results.
Although search bots don’t accept cookies, if you are crawling a site and need to allow cookies, simply select ‘Allow Cookies’ in the ‘Advanced’ tab of the Spider Configuration menu. To crawl using a different user agent, select ‘User Agent’ in the ‘Configuration’ menu, then select a search bot from the drop-down or type in your desired user agent strings.
As Google is now mobile-first, try crawling the site as Googlebot Smartphone, or modify the User-Agent to be a spoof of Googlebot Smartphone.
This is important for two different reasons:. Crawling the site mimicking the Googlebot Smartphone user agent may help determine any issues that Google is having when crawling and rendering your site’s content.
Using a modified version of the Googlebot Smartphone user agent will help you distinguish between your crawls and Google’s crawls when analyzing server logs. When the Screaming Frog spider comes across a page that is password-protected, a pop-up box will appear, in which you can enter the required username and password.
To manage authentication, navigate to Configuration > Authentication. In order to turn off authentication requests, deselect ‘Standards Based Authentication’ in the ‘Authentication’ window from the Configuration menu.
- Once the spider has finished crawling, use the Bulk Export menu to export a CSV of ‘All Links’. This will provide you with all of the link locations, as well as the corresponding anchor text, directives, etc. All inlinks can be a big report. Be mindful of this when exporting. For a large site, this export can sometimes take minutes to run. For a quick tally of the number of links on each page, go to the ‘Internal’ tab and sort by ‘Outlinks’.
Any 404’s, 301’s or other status codes will be easily viewable. Upon clicking on any individual URL in the crawl results, you’ll see information change in the bottom window of the program. By clicking on the ‘In Links’ tab in the bottom window, you’ll find a list of pages that are linking to the selected URL, as well as anchor text and directives used on those links.
You can use this feature to identify pages where internal links need to be updated. To export the full list of pages that include broken or redirected links, choose ‘Redirection (3xx) In Links’ or ‘Client Error (4xx) In Links’ or ‘Server Error (5xx) In Links’ in the ‘Advanced Export’ menu, and you’ll get a CSV export of the data.
To export the full list of pages that include broken or redirected links, visit the Bulk Export menu.