Nginx anti-crawler strategy to prevent UA from crawling websites

Added anti-crawler policy file:

vim /usr/www/server/nginx/conf/anti_spider.conf

File Contents

#Disable crawling by tools such as Scrapy if ($http_user_agent ~* (Scrapy|Curl|HttpClient)) { 
   return 403; 
} 
#Disable access with specified UA or empty UAif ($http_user_agent ~ "WinHttp|WebZIP|FetchURL|node-superagent|java/|FeedDemon|Jullo|JikeSpider|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|Java|Feedly|Apache-HttpAsyncClient|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|ZmEu|oBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|HttpClient|MJ12bot|heritrix|EasouSpider|Ezooms|BOT/0.1|YandexBot|FlightDeckReports|Linguee Bot|^$" ) { 
   return 403;        
} 
#Disable crawling by methods other than GET|HEAD|POST if ($request_method !~ ^(GET|HEAD|POST)$) { 
  return 403; 
}
#The command to block a single IP is #deny 123.45.6.7
#Block the entire segment from 123.0.0.1 to 123.255.255.254#deny 123.0.0.0/8
#Block the IP range from 123.45.0.1 to 123.45.255.254 #deny 124.45.0.0/16
#The command to block the IP range from 123.45.6.1 to 123.45.6.254 is #deny 123.45.6.0/24
# The following IPs are all rogue #deny 58.95.66.0/24;

Configuration Usage

Introduce in the site's server

# Anti-crawler include /usr/www/server/nginx/conf/anti_spider.conf

Finally restart nginx

Verify whether it is valid

Simulating YYSpider

λ curl -X GET -I -A 'YYSpider' https://www.myong.top
HTTP/1.1 200 Connection established
HTTP/2 403
server: marco/2.11
date: Fri, 20 Mar 2020 08:48:50 GMT
content-type: text/html
content-length: 146
x-source: C/403
x-request-id: 3ed800d296a12ebcddc4d61c57500aa2

Simulate Baiduspider

λ curl -X GET -I -A 'BaiduSpider' https://www.myong.top
HTTP/1.1 200 Connection established
HTTP/2 200
server: marco/2.11
date: Fri, 20 Mar 2020 08:49:47 GMT
content-type: text/html
vary: Accept-Encoding
x-source: C/200
last-modified: Wed, 18 Mar 2020 13:16:50 GMT
etag: "5e721f42-150ce"
x-request-id: e82999a78b7d7ea2e9ff18b6f1f4cc84

Common User-Agents for Crawler

FeedDemon content collection BOT/0.1 (BOT for JCE) sql injection CrawlDaddy sql injection Java content collection Jullo content collection Feedly content collection UniversalFeedParser content collection ApacheBench cc attacker Swiftbot useless crawler YandexBot useless crawler AhrefsBot useless crawler YisouSpider useless crawler (has been acquired by UC Shenma Search, this spider can be released!) 
jikeSpider useless crawlerMJ12bot useless crawlerZmEu phpmyadmin vulnerability scanningWinHttp collectioncc attackEasouSpider useless crawlerHttpClient tcp attackMicrosoft URL Control scanningYYSpider useless crawlerjaunty wordpress blasting scanneroBot useless crawlerPython-urllib content collectionIndy Library scanningFlightDeckReports Bot useless crawlerLinguee Bot useless crawler

The above is the details of Nginx anti-crawler strategy to prevent UA from crawling the website. For more information about Nginx anti-crawler, please pay attention to other related articles on 123WORDPRESS.COM!

You may also be interested in:

SpringBoot+webMagic implements website crawler example code
Springboot+webmagic implements java crawler jdbc and mysql method
Python crawler crawls Taobao product price comparison (with Taobao anti-crawler mechanism solution)
Summary of methods to bypass anti-crawler in Python
Detailed explanation of how to deal with Python anti-crawler knowledge points with cookies
Detailed explanation of 4 ways to bypass anti-crawler mechanisms in Selenium-webdriver
Website Anti-Crawler Strategy
Python common anti-crawler strategies
Selenium anti-crawler to skip the Taobao slider verification function implementation code
Python crawler and anti-crawler war
Python anti-crawler disguises the browser to crawl
How to use springboot anti-crawler component kk-anti-reptile

<<: Web interview: The difference between MVC and MVVM and why Vue does not fully comply with MVVM

>>: How to optimize MySQL index function based on Explain keyword