Guidelines and Best Practices for Bots, Crawlers, or Scrapers


Overview

Engine Media uses various mechanisms to ensure the stability, security, ownership rights, and cost efficiency of our clients’ websites. These mechanisms can include manual or automated blocking of bots or other crawlers that are detected as not respecting our guidelines. If your business includes the need to crawl or scrape content from one or more of our sites, we ask that you implement a responsible crawler that respects the following guidelines.  If you do not follow these guidelines, your crawler may be blocked.

Guidelines

Robots.txt : Respect all rules listed in each site’s robots.txt file. This includes respecting the list of URL’s to “Disallow” from crawling, as well as respecting the “Crawl-Delay” that defines how many seconds your crawler should wait between individual requests.

Reference: https://yoast.com/ultimate-guide-robots-txt/#crawl-delay-directive

User-Agent : Use a distinct and identifiable user-agent so that your usage can be identified and not mistakenly flagged as as malicious bot. Through your Customer Success Manager, we can also whitelist your user-agent from certain automated checks/blocks as long as it does not abuse these policies. You should use your own distinct user-agent, rather than re-using a normal consumer browser’s user-agent. The practice of masquerading as a normal consumer browser can be seen as a malicious action that can be flagged by our systems. We recommend that you include your company name in the user-agent and/or a contact mechanism in case we need to contact you. For example, Google may use a user-agent such as “Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)” .

Reference: https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent

Request Throttling: Ensure that you do not overload our systems with excessive requests. As stated above, always follow the “Crawl-Delay” setting within the robots.txt file. If you are crawling multiple Engine Media sites simultaneously, we may require you to implement a “global” crawl delay or throttle across all your requests on all sites. Additionally, many pieces of content do not update frequently and may be crawled at slower intervals.  For example, category listings may update more frequently (5min) than story content (1 hour).

Caching : Ensure that you cache any crawled content, especially images, etc. For example, if you are a Mobile App vendor crawling one of our websites, ensure that you cache & serve the content from your own servers, rather than having each consumer’s Mobile App crawl/scrape content from our sites directly. Where not specified via a standard cache-control header, we recommend 1 hour caching of story pages and 5min caching on categories or other pages. Images or other static assets should be cached for at least 24 hours. Video files should be cached for 1 year as they cannot be modified once uploaded (without creating a new unique URL).

Content Age: If possible, we recommend delaying crawling based on each individual content’s age. Older content is rarely changed, so we recommend slowly backing off the crawling interval on individual pieces of content over time. For example, a piece of content created 6 months ago may not ever get updated again, and can therefore be crawled less frequently. For example, you may implement an incrementing delay of an additional 1 hour for each day since the content was published (or modified), eventually capping a maximum delay of 24 hours between crawls per individual item.

Peak Hours : We recommend crawling only during off-peak hours where possible, or using a reduced crawl rate during peak hours. For most sites, off-peak hours are during the overnight period in their local timezone.

Error Handling : In certain scenarios you may receive an error from our service during your crawls. In the case of a server-side error (such as a 503 or 408 error), ensure that you use a retry-throttle of at least 1 minute to give our systems time to recover before trying again. If you receive a “403 Permission Denied” error, it usually means your crawler has been permanently blocked due to violating one of these rules. You may also receive a 429 error if you are not throttling your requests enough.  We highly recommend using something like a “serve-stale-while-refresh” logic for your users so that you can always deliver previously fetched content from your cache until that content can be successfully updated from Engine Media servers.

De-duplicate URL’s : To avoid excessive requests, you should always deduplicate URL’s while crawling. For example, you may receive the same story URL from multiple category listings. In these cases, you should only crawl the story page once according to the throttling limits above, rather than each time it shows up in different feeds. Additionally, if crawling a web page (versus a feed request), there will always be a canonical tag in the HTML that you should use to de-duplicate that URL (and you should retain a mapping of that canonical URL to de-duplicate subsequent requests). Additionally, Engine Media does not use Query Strings within its web page URL’s, so any query string (such as a UTM code, etc) should be removed before requesting and checked against your de-duplicating logic.

Copyright : Respect the copyright of all content owners as well as the Terms of Service and Privacy Policies of each site. Please ensure you have the proper permission and rights to crawl, scrape, and use content from our sites.

Personal Data : Do not crawl consumers’ personal data. If the site you are crawling contains publicly available consumer data (such as a User’s Profile), do not crawl or reuse this data. If you have a valid business need to use personal data, please contact your Customer Success Manager to establish a process for your management of this data within our GDPR/CCPA, or other privacy procedures.

For questions or assistance with these policies, or if your crawler has been blocked, please contact your Customer Success Manager.