Skip to main content
PyCon US is happening May 14th-22nd in Pittsburgh, PA USA.  Learn more

Your friendly neighborhood web scraper

Project description

https://badge.fury.io/py/pyrobot.png https://travis-ci.org/jmcarp/pyrobot.png?branch=master https://coveralls.io/repos/jmcarp/pyrobot/badge.png?branch=master

Homepage: http://pyrobot.readthedocs.org/

import re
from pyrobot import RoboBrowser

# Browse to Rap Genius
browser = RoboBrowser(history=True)
browser.open('http://rapgenius.com/')

# Search for Queen
form = browser.get_form(action=re.compile(r'search'))
form['q'].value = 'queen'
browser.submit_form(form)

# Look up the first song
songs = browser.select('.song_name')
browser.follow_link(songs[0])
lyrics = browser.find(class_=re.compile(r'\blyrics\b'))
lyrics.text     # \n[Intro]\nIs this the real life...

# Back to results page
browser.back()

# Look up my favorite song
browser.follow_link('death on two legs')
lyrics = browser.find(class_=re.compile(r'\blyrics\b'))
lyrics.text     # \n[Verse 1]\nYou suck my blood like a leech...

PyRobot combines the best of two excellent Python libraries: Requests and BeautifulSoup. PyRobot represents browser sessions using Requests and HTML responses using BeautifulSoup, transparently exposing methods of both libraries:

import re
from pyrobot import RoboBrowser

browser = RoboBrowser(user_agent='a python robot')
browser.open('https://github.com/')

# Inspect the browser session
browser.session.cookies['_gh_sess']         # BAh7Bzo...
browser.session.headers['User-Agent']       # a python robot

# Searched the parsed HTML
browser.select('div.teaser-icon')       # [<div class="teaser-icon">
                                        # <span class="mega-octicon octicon-checklist"></span>
                                        # </div>,
                                        # ...
browser.find(class_=re.compile(r'column', re.I))    # <div class="one-third column">
                                                    # <div class="teaser-icon">
                                                    # <span class="mega-octicon octicon-checklist"></span>
                                                    # ...

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page