Scraping the second page of a website in Python does not work -
let's want scrape data here.
i can nicely using urlopen
, beautifulsoup
in python 2.7.
now if want scrape data second page this address.
what data first page! looked @ page source of second page using "view page source" of chrome , content belongs first page!
how can scrape data second page?
the page of quite asynchronous nature, there xhr requests forming search results, simulate them in code using requests
. sample code starting point you:
from bs4 import beautifulsoup import requests url = 'http://www.amazon.com/best-sellers-books-architecture/zgbs/books/173508/#2' ajax_url = "http://www.amazon.com/best-sellers-books-architecture/zgbs/books/173508/ref=zg_bs_173508_pg_2" def get_books(data): soup = beautifulsoup(data) title in soup.select("div.zg_itemimmersion div.zg_title a"): print title.get_text(strip=true) requests.session() session: session.get(url) session.headers = { 'user-agent': 'mozilla/5.0 (linux; u; android 4.0.3; ko-kr; lg-l160l build/iml74k) applewebkit/534.30 (khtml, gecko) version/4.0 mobile safari/534.30', 'x-requested-with': 'xmlhttprequest' } page in range(1, 10): print "page #%d" % page params = { "_encoding": "utf8", "pg": str(page), "ajax": "1" } response = session.get(ajax_url, params=params) get_books(response.content) params["isabovethefold"] = "0" response = session.get(ajax_url, params=params) get_books(response.content)
and don't forget good web-scraping citizen , follow terms of use.
Comments
Post a Comment