web crawler - crawl coursera webpage using wget with authentication -
i trying crawl webpages in coursera, important review after course, such syllabus, homework, etc.
i using wget, found login required. tried 2 post: 1 2. none of them work.
i found coursera webpages not end *.html or *.htm. there way pass through login , download webpages using wget in coursera?
this python package, https://github.com/dgorissen/coursera-dl, may more applicable asking except doesn't use wget , uses, , requires, python instead. author notes using python 2.7 , pip package. advantage package can download related course in 1 run.
do note need accept honor code coursera class, first time open class page, before script run correctly noted on main project page , in readme.md. project, unlike @ least 1 on github.com, actively maintained recent update within last 6 months.
i strongly recommend check 1 of python packages as, in own testing on windows (unless find difference wget on platform), appears wget tool continues have issue coursera secure certificate despite inclusion of --no-check-certificate
in either command.
this testing done gnu wget 1.14 built on mingw32 version string. lastly, please note same result encountered both v1 , v3 of coursera login protocol.
wget (using coursera login v1, comment below):
wget --save-cookies=cookies.txt --no-check-certificate --keep-session-cookies --post-data="email=email@example.com&password=mypassword&webrequest=true" https://accounts.coursera.org/api/v1/login? resolving accounts.coursera.org (accounts.coursera.org)... 54.225.163.33, 107.20 .232.186, 54.243.110.245 connecting accounts.coursera.org (accounts.coursera.org)|54.225.163.33|:443.. . connected. warning: cannot verify accounts.coursera.org's certificate, issued ... unable locally verify issuer's authority. http request sent, awaiting response... 400 bad request error 400: bad request.
wget update (using coursera login v3):
note wget (tested on windows) not appear work coursera login v1 (comment below) or coursera login v3 (immediately below:
wget https://accounts.coursera.org/api/login/v3/login? --save-cookies cookies.txt --keep-session-cookies --no-check-certificate --post-data "email=email@example.com&password=mypassword&webrequest=true" resolving accounts.coursera.org (accounts.coursera.org)... 50.19.244.62, 107.20. 145.110, 54.221.210.127 connecting accounts.coursera.org (accounts.coursera.org)|50.19.244.62|:443... connected. warning: cannot verify accounts.coursera.org's certificate, issued ... unable locally verify issuer's authority. http request sent, awaiting response... 400 bad request error 400: bad request.
Comments
Post a Comment