Screaming Frog Crawl

URL Rewriting

Screaming Frog is a very powerful SEO Spider able to perform in-depth SEO OnSite analysis. In this guide, we will see some of the main features very useful during SEO analysis.

The free version of Screaming Frog allows you to analyze up to 500 URLs. Screaming Frog allows you to crawl a specific website, subdomain or directory. In the paid version, the SEO Spider allows you to select the “Crawl all Subdomain” option if you have more than one subdomain.

If you only need to crawl one subdomain, simply add the URL in the appropriate box. The most commonly used features are monitoring status queues on a website (40x,50x,200 and 30x). Screaming Frog by default crawls a directory by simply adding the address in the bar as presented in the image below.

If you need to perform an advanced crawling you can use the wildcard that tells the SEO Spider to crawl all pages that precede and/or follow the “Wildcard”. The path to use this feature is:.

Spider > Include and add in the box that appears the desired syntax, for example with this syntax: https://www.bytekmarketing.com/about/.* the spider only crawls the sections of the website that are present in the “About Us” branch of the website, then all the resources that are after the “Jolly” character.

Starting the crawl will extract all the “Daughters” URLs of the “About Us” section, for example: https://www.bytekmarketing.com/about/roberto-paolucci or https://www.bytekmarketing.com/about/mario-rossi.

Configuration Options

This option is particularly useful with large websites where we do not have resources to work on very large data. Keep in mind that the crawling data will have to be (in most cases) processed in Excel, so the starting point will have to be a workable data in an easy way to “Search Vert”, work with filters and charts.

From the “Mode” tab you can select the crawling mode, in case you want to crawl a set of URLs the mode to set is “List” because you can import an Excel file with a column containing the list of URLs.

The other option to scan a list of URLs is “copy and paste”, then copy from an external source (Excel, CSV, TXT or HTML page) the list of URLs and click “Paste”. It is necessary that each URL also contains the http or HTTPS protocol including the www, so the correct structure of each URL should be: http://www.test.it.

Conclusions

When you need to analyze a large website and it’s not enough to just crawl through HTML and images (in SEO perspective very often it’s good to also analyze the status queues of CSS and JS files to make sure that search engine spiders are able to correctly render pages) you can work on the settings:.

1.Configuration > System > Memory and allocate more memory, for example 4GB2.Set the storage to the database instead of RAM. If even with these two configurations it is not possible to analyze a large website, the only settings that can be activated are:.

1.Start crawling by website branches, one and more branches at a time:. With wildcard character;. Include/Exclude option;. Custom robots.txt;. Navigation Depth (Crawl Depth);.

Query string parameters. 2.Exclude from crawling: Images, CSS, JS and other non-HTML resources. From an SEO perspective, it is essential to perform a single crawl because it allows you to have a complete view, for example the pair of URLs From and URL To in reference to 301, 404 or the monitoring of the distribution of internal links.

It may happen that Screaming Frog goes in time out or, in general, it can’t analyze resources (or it’s very slow) even on small websites; in this case the problem could be related to other factors, such as hosting performance or the fact that our IP address (from which we started Screaming Frog) has been blocked by the website owner (or by the dedicated IT resource).

Our IP address can be banned by a provider because the action of Screaming Frog is very similar to a hacker attack (e.g. DOS attack) aimed at running out of server resources and causing 50x errors. After finishing crawling the website there are multiple export options:.

Save the source of Screaming Frog: Having the source allows you to control the crawling data without having to start it again. Especially useful for large websites or to collaborate with colleagues and share the source. Having the source allows you to control the crawling data without having to start it again.

Search performances

Especially useful for large websites or to collaborate with colleagues and share the source. Save only the necessary tab;. Export all pages to a single Excel file;. Bulk export, very useful to have, for example, full internal link distribution: All inlinks (for internal linking analysis);All outlinks;All anchor text;All images;Schema.org structured data;….

All inlinks (for internal linking analysis);. All anchor text;. Schema.org structured data;. The image below shows how to export schema.org structured data.

Screaming Frog allows you to export a configuration file that can be reused for future projects/customers. It is particularly useful if you perform SEO analysis for similar clients (similar website structure) and have configured advanced filters or special extraction options (filters, exclude/include or wildcard).

Saving the Crawl

The configuration file is also useful if custom scripts have been programmed, for example in Python or from the command line to automate purely mechanical operations. For example, if we need to perform a series of purely technical SEO Audits and the output requires the same data, it would make no sense, for each website, to re-configure Screaming Frog.

Screaming Frog is “Robots.txt Compliant” so it is able to perfectly follow the guidelines indicated in robots.txt exactly like Google Search. Through the configuration options it is possible:.

ignore the robots.txt;. see the URLs blocked by the robots.txt;. option to use custom robots.txt. The last option may come in handy before the go-live of a website to test the robots.txt file to see if the directives in the file are correct.

By default, Screaming Frog does not accept cookies, as do search engine spiders. This option is often underestimated or ignored but in fact, for some websites, it is of fundamental importance because by accepting cookies you can unlock features and add code that can give extremely useful SEO and performance information.

When To Crawl Using JavaScript

For example, by accepting cookies you can unlock a small JavaScript that adds code to the HTML of the page… and if this code creates SEO side problems how can I verify it? Screaming Frog helps us in this case as shown in the image below.

One of the best methods to create a sitemap is to use an SEO Tool like Screaming Frog, also the use of WordPress plugins like SEO Yoast are fine, but there may be update and non-compatibility problems, for example, it may happen that the URLs in the sitemap return status code 404.

It is recommended to generate a sitemap that contains only canonical URLs with status code 200. For large websites, it is recommended to create a sitemap for each type of content (PDF, images and HTML pages) and a sitemap for each branch of the information architecture.

Having specific sitemaps allows the search engine to better analyze URLs and file types and allows you to have full control and easily make a comparison between URLs in Google Search index (site operator:) and individual sitemaps.

Please note that the limit of URLs to add in a sitemap is 49,999. For details on standards see: https://www.sitemaps.org/protocol.html-. To generate a Screaming Frog sitemap follow the steps below:.

Sitemaps (top bar) > XML Sitemap or Images Sitemap. Among the Screaming Frog options you can decide which pages to include based on:. paginated URLs;. Change frequency;. Noindex images;. Include relevant images based on the number of links they receive;. Include images from a CDN.

For large websites, e.g. e-commerce, product photos can be uploaded to a subdomain or external hosting, for a variety of reasons such as:. Avoid the absorption of resources allocated to the CMS;.

Internal links

Ease of management, as management scripts can be created only to images to improve their performance,. Management of cron job for synchronization between physical warehouse and e-commerce.

With regard to the structure of the website with a particular focus on the information architecture, the “Visualisations” section is useful as it allows to have a graphic vision of the website structure, in diagrams or graphs.

Introduction To Crawling JavaScript

During an internal linking analysis, this section is fundamental but it is recommended to integrate it with mind-map programs, such as XMind and with standard tools:https://rawgraphs.io/. The configuration options of the SEO Spider are collected and organized in tabs, in this paragraph, we will examine the macro tabs without going into detail on all the individual options.

External links;. Link outside of the start folder;. Follow internal or external nofollow;. Crawl all subdomains;. Crawl outside of the start folder;.

Crawl canonical;. Extraction of hreflang;. Crawl of links inside the sitemap;. Extraction and crawl of AMP links;. This tab is particularly useful for analyzing very large websites but not only. From this section you can:. Set the total crawl limit, expressed in the number of URLs;.

The crawl depth expressed in the number of directories;. The limit in the number of query strings;.

The limit of redirect 301 to follow (to avoid the chains of 301, harmful in terms of resource use and therefore crawl budget);. Length of URLs to follow, default is 2,000 characters;.

How To Crawl JavaScript Rich Websites

Maximum weight of pages to analyze. Pause on high memory usage;. Always follow redirects;.

Always follow canonicals;. Respect noindex;. Respect canonical;. Respect Next/Prev;.

Extract images from img srcset Attribute;. Respect HSTS Policy;.

Comments are closed.