web crawler - crawl coursera webpage using wget with authentication -


i trying crawl webpages in coursera, important review after course, such syllabus, homework, etc.

i using wget, found login required. tried 2 post: 1 2. none of them work.

i found coursera webpages not end *.html or *.htm. there way pass through login , download webpages using wget in coursera?

this python package, https://github.com/dgorissen/coursera-dl, may more applicable asking except doesn't use wget , uses, , requires, python instead. author notes using python 2.7 , pip package. advantage package can download related course in 1 run.

do note need accept honor code coursera class, first time open class page, before script run correctly noted on main project page , in readme.md. project, unlike @ least 1 on github.com, actively maintained recent update within last 6 months.

i strongly recommend check 1 of python packages as, in own testing on windows (unless find difference wget on platform), appears wget tool continues have issue coursera secure certificate despite inclusion of --no-check-certificate in either command.

this testing done gnu wget 1.14 built on mingw32 version string. lastly, please note same result encountered both v1 , v3 of coursera login protocol.

wget (using coursera login v1, comment below):

wget --save-cookies=cookies.txt --no-check-certificate --keep-session-cookies --post-data="email=email@example.com&password=mypassword&webrequest=true"  https://accounts.coursera.org/api/v1/login?  resolving accounts.coursera.org (accounts.coursera.org)... 54.225.163.33, 107.20 .232.186, 54.243.110.245 connecting accounts.coursera.org (accounts.coursera.org)|54.225.163.33|:443.. . connected. warning: cannot verify accounts.coursera.org's certificate, issued ...   unable locally verify issuer's authority. http request sent, awaiting response... 400 bad request error 400: bad request. 

wget update (using coursera login v3):

note wget (tested on windows) not appear work coursera login v1 (comment below) or coursera login v3 (immediately below:

wget https://accounts.coursera.org/api/login/v3/login? --save-cookies cookies.txt --keep-session-cookies --no-check-certificate --post-data "email=email@example.com&password=mypassword&webrequest=true"   resolving accounts.coursera.org (accounts.coursera.org)... 50.19.244.62, 107.20. 145.110, 54.221.210.127 connecting accounts.coursera.org (accounts.coursera.org)|50.19.244.62|:443...  connected. warning: cannot verify accounts.coursera.org's certificate, issued ...   unable locally verify issuer's authority. http request sent, awaiting response... 400 bad request error 400: bad request. 

Comments

Popular posts from this blog

apache - PHP Soap issue while content length is larger -

asynchronous - Python asyncio task got bad yield -

javascript - Complete OpenIDConnect auth when requesting via Ajax -