Development

Python Urllib Robotparser

Captain Salem 2 min read
Python Urllib Robotparser

In this post, you will learn about the robotparser module in the urllib package, which provides the RobotFileParser class for determining if a given user agent can access a URL specified in the robots.txt file.

RobotFileParser Class

The RobotFileParser class provides various methods for reading, parsing, and answering questions about the robots.txt file at a given resource.

The supported methods include:

  1. set_url() – defines the URL for the robots.txt file.
  2. read() – reads the robots.txt file and feeds it into the robots.txt parser.
  3. parse(lines) – parses the line argument.
  4. can_fetch(useragent) – checks if a specified user agent can access a specified URL as specified in the robots.txt file.
  5. mtime() – returns the time the robots.txt file was fetched,
  6. modified() – updates the last fetch time for the robots.txt to the current time.
  7. crawl_delay(useragent, URL) – returns the value of the crawl_delay parameter.
  8. request_rate(useragent) – returns the request-rate parameter as a named tuple.
  9. site_maps() – returns the sitemap parameter from the robots.txt file as a list.

Example Use Case

The following code shows the usage of the RobotFileParser class and the provided methods.

import urllib.robotparser

rp = urllib.robotparser.RobotFileParser()
rp.set_url("https://cloudenv.io/robots.txt")
rp.read()

# Get the request rate and handle cases where it's None
req_rate = rp.request_rate("*")
if req_rate is not None:
    req_rate_seconds = req_rate.seconds
else:
    req_rate_seconds = None

# Get the crawl delay
crawl_delay = rp.crawl_delay("*")

# Check if URLs can be fetched
can_fetch_tags = rp.can_fetch("*", "https://cloudenv.io/tags/")
can_fetch_email = rp.can_fetch("*", "https://cloudenv.io/email")

# Print results for debugging
print(f"Request rate seconds: {req_rate_seconds}")
print(f"Crawl delay: {crawl_delay}")
print(f"Can fetch tags: {can_fetch_tags}")
print(f"Can fetch email: {can_fetch_email}")

The code above starts by importing the robotparser module and creating an instance of the RobotFileParser class.

We then pass the URL to the robots.txt file and send the file to the parser. We then use the provided methods to perform various actions.

The code above should return:

Request rate seconds: None
Crawl delay: None
Can fetch tags: True
Can fetch email: False

Conclusion

In this article, we discussed how to use the robotparser module from the urllib package allowing you to perform various actions as provided in the robots.txt file.

Share
Comments
More from Cloudenv

Cloudenv

Developer Tips, Tricks and Tutorials.

Great! You’ve successfully signed up.

Welcome back! You've successfully signed in.

You've successfully subscribed to Cloudenv.

Success! Check your email for magic link to sign-in.

Success! Your billing info has been updated.

Your billing was not updated.