i'm scraping data a site. i'm russia , when use standard ip , go url, page represented wrong, without data. when use britain proxy it's ok.
that's why have use proxy while scraping strange problem. when try go http://www.indeed.com/resumes/data-scientist/in-london?co=gb&start=1000 via browser works(it contains data). when script it's represented in other way.
for reason parser doesn't represent pages http://www.indeed.com/resumes/data-scientist/in-london?co=gb&start=1000 can see them via browser.
for examle, html code of http://www.indeed.com/resumes/data-scientist/in-london?co=gb&start=950 differences begin:
via browser(as need):
<div id="pagination">page:<a class="instl confirm-nav previous" rel="nofollow" href="?q=data+scientist&l=london&co=gb&start=900">« previous</a><a class="instl confirm-nav" rel="nofollow" href="?q=data+scientist&l=london&co=gb&start=850">18</a><a class="instl confirm-nav" rel="nofollow" href="?q=data+scientist&l=london&co=gb&start=900">19</a><span class="current_page">20</span><a class="instl confirm-nav" rel="nofollow" href="?q=data+scientist&l=london&co=gb&start=1000">21</a><a class="instl confirm-nav" rel="nofollow" href="?q=data+scientist&l=london&co=gb&start=1050">22</a><a class="instl confirm-nav" rel="nofollow" href="?q=data+scientist&l=london&co=gb&start=1100">23</a><a class="instl confirm-nav" rel="nofollow" href="?q=data+scientist&l=london&co=gb&start=1150">24</a><a class="instl confirm-nav next" rel="nofollow" href="?q=data+scientist&l=london&co=gb&start=1000">next »</a></div><div id="footer" class=""><p id="footer_nav" class="footer_nav">
the same place parser(wrong):
</div><div id="pagination">page:<a class="instl confi rm-nav previous" href="?q=data+scientist&l=london&co=gb&start=900" rel="nofollow">< previous</a><a class="in stl confirm-nav" href="?q=data+scientist&l=london&co=gb&start=850" rel="nofollow">18</a><a class="instl conf irm-nav" href="?q=data+scientist&l=london&co=gb&start=900" rel="nofollow">19</a><span class="current_page">2 0</span></div><div class="" id="footer"><p class="footer_nav" id="footer_nav">
i'm on win7, use python3 , beautifulsoup.
code:
from bs4 import beautifulsoup import requests proxy = {"http": "http://134.213.145.228:8080"} headers = {'user-agent': 'mozilla/5.0 (macintosh; intel mac os x 10_10_1) applewebkit/537.36 (khtml, gecko) chrome/39.0.2171.95 safari/537.36'} page_url = 'http://www.indeed.com/resumes/data-scientist/in-london?co=gb&start=950' req = requests.get(page_url, proxies=proxy, headers=headers) req.encoding = 'utf-8' main = beautifulsoup(req.text, 'html.parser') profile_urls_tag = main.find_all('a', class_="app_link")
edited1:
one intresting think think problem in it. when use same proxy in mozilla can see 20 pages chrome - 40.
edited2: problem has been solved. appears must register , log-in see full information.
Comments
Post a Comment