Scraping the second page of a website in Python does not work -
let's want scrape data here.
i can nicely using urlopen , beautifulsoup in python 2.7.
now if want scrape data second page this address.
what data first page! looked @ page source of second page using "view page source" of chrome , content belongs first page!
how can scrape data second page?
the page of quite asynchronous nature, there xhr requests forming search results, simulate them in code using requests. sample code starting point you:
from bs4 import beautifulsoup import requests url = 'http://www.amazon.com/best-sellers-books-architecture/zgbs/books/173508/#2' ajax_url = "http://www.amazon.com/best-sellers-books-architecture/zgbs/books/173508/ref=zg_bs_173508_pg_2" def get_books(data): soup = beautifulsoup(data) title in soup.select("div.zg_itemimmersion div.zg_title a"): print title.get_text(strip=true) requests.session() session: session.get(url) session.headers = { 'user-agent': 'mozilla/5.0 (linux; u; android 4.0.3; ko-kr; lg-l160l build/iml74k) applewebkit/534.30 (khtml, gecko) version/4.0 mobile safari/534.30', 'x-requested-with': 'xmlhttprequest' } page in range(1, 10): print "page #%d" % page params = { "_encoding": "utf8", "pg": str(page), "ajax": "1" } response = session.get(ajax_url, params=params) get_books(response.content) params["isabovethefold"] = "0" response = session.get(ajax_url, params=params) get_books(response.content) and don't forget good web-scraping citizen , follow terms of use.
Comments
Post a Comment