Scraping the second page of a website in Python does not work -


let's want scrape data here.

i can nicely using urlopen , beautifulsoup in python 2.7.

now if want scrape data second page this address.

what data first page! looked @ page source of second page using "view page source" of chrome , content belongs first page!

how can scrape data second page?

the page of quite asynchronous nature, there xhr requests forming search results, simulate them in code using requests. sample code starting point you:

from bs4 import beautifulsoup import requests  url = 'http://www.amazon.com/best-sellers-books-architecture/zgbs/books/173508/#2' ajax_url = "http://www.amazon.com/best-sellers-books-architecture/zgbs/books/173508/ref=zg_bs_173508_pg_2"  def get_books(data):     soup = beautifulsoup(data)      title in soup.select("div.zg_itemimmersion div.zg_title a"):         print title.get_text(strip=true)   requests.session() session:     session.get(url)      session.headers = {         'user-agent': 'mozilla/5.0 (linux; u; android 4.0.3; ko-kr; lg-l160l build/iml74k) applewebkit/534.30 (khtml, gecko) version/4.0 mobile safari/534.30',         'x-requested-with': 'xmlhttprequest'     }      page in range(1, 10):         print "page #%d" % page          params = {             "_encoding": "utf8",             "pg": str(page),             "ajax": "1"         }         response = session.get(ajax_url, params=params)         get_books(response.content)          params["isabovethefold"] = "0"         response = session.get(ajax_url, params=params)         get_books(response.content) 

and don't forget good web-scraping citizen , follow terms of use.


Comments

Popular posts from this blog

apache - PHP Soap issue while content length is larger -

asynchronous - Python asyncio task got bad yield -

javascript - Complete OpenIDConnect auth when requesting via Ajax -