python - Scraper can't get the correst page with proxy -

i'm scraping data a site. i'm russia , when use standard ip , go url, page represented wrong, without data. when use britain proxy it's ok.

that's why have use proxy while scraping strange problem. when try go http://www.indeed.com/resumes/data-scientist/in-london?co=gb&start=1000 via browser works(it contains data). when script it's represented in other way.

for reason parser doesn't represent pages http://www.indeed.com/resumes/data-scientist/in-london?co=gb&start=1000 can see them via browser.

for examle, html code of http://www.indeed.com/resumes/data-scientist/in-london?co=gb&start=950 differences begin:

via browser(as need):

<div id="pagination">page:<a class="instl confirm-nav previous" rel="nofollow" href="?q=data+scientist&amp;l=london&amp;co=gb&amp;start=900">« previous</a><a class="instl confirm-nav" rel="nofollow" href="?q=data+scientist&amp;l=london&amp;co=gb&amp;start=850">18</a><a class="instl confirm-nav" rel="nofollow" href="?q=data+scientist&amp;l=london&amp;co=gb&amp;start=900">19</a><span class="current_page">20</span><a class="instl confirm-nav" rel="nofollow" href="?q=data+scientist&amp;l=london&amp;co=gb&amp;start=1000">21</a><a class="instl confirm-nav" rel="nofollow" href="?q=data+scientist&amp;l=london&amp;co=gb&amp;start=1050">22</a><a class="instl confirm-nav" rel="nofollow" href="?q=data+scientist&amp;l=london&amp;co=gb&amp;start=1100">23</a><a class="instl confirm-nav" rel="nofollow" href="?q=data+scientist&amp;l=london&amp;co=gb&amp;start=1150">24</a><a class="instl confirm-nav next" rel="nofollow" href="?q=data+scientist&amp;l=london&amp;co=gb&amp;start=1000">next »</a></div><div id="footer" class=""><p id="footer_nav" class="footer_nav">

the same place parser(wrong):

</div><div id="pagination">page:<a class="instl confi rm-nav previous" href="?q=data+scientist&amp;l=london&amp;co=gb&amp;start=900" rel="nofollow">< previous</a><a class="in stl confirm-nav" href="?q=data+scientist&amp;l=london&amp;co=gb&amp;start=850" rel="nofollow">18</a><a class="instl conf irm-nav" href="?q=data+scientist&amp;l=london&amp;co=gb&amp;start=900" rel="nofollow">19</a><span class="current_page">2 0</span></div><div class="" id="footer"><p class="footer_nav" id="footer_nav">

i'm on win7, use python3 , beautifulsoup.

code:

from bs4 import beautifulsoup import requests  proxy = {"http": "http://134.213.145.228:8080"} headers = {'user-agent': 'mozilla/5.0 (macintosh; intel mac os x 10_10_1) applewebkit/537.36 (khtml, gecko) chrome/39.0.2171.95 safari/537.36'} page_url = 'http://www.indeed.com/resumes/data-scientist/in-london?co=gb&start=950' req = requests.get(page_url, proxies=proxy, headers=headers) req.encoding = 'utf-8' main = beautifulsoup(req.text, 'html.parser') profile_urls_tag = main.find_all('a', class_="app_link")

edited1:

one intresting think think problem in it. when use same proxy in mozilla can see 20 pages chrome - 40.

edited2: problem has been solved. appears must register , log-in see full information.

Shah

Search This Blog

python - Scraper can't get the correst page with proxy -

Comments

Post a Comment