this sort of follow-up question one asked earlier.
i'm trying scrape webpage have login reach first. after authentication, webpage need requires little bit of javascript run before can view content. i've done followed instructions here install splash try render javascript. however...
before switched splash, authentication scrapy's initspider
fine. getting through login page , scraping target page ok (except without javascript working, obviously). once add code pass requests through splash, looks i'm not parsing target page.
spider below. difference between splash version (here) , non-splash version function def start_requests()
. else same between two.
import scrapy scrapy.spiders.init import initspider scrapy.spiders import rule scrapy.linkextractors import linkextractor class bbospider(initspider): name = "bbo" allowed_domains = ["bridgebase.com"] start_urls = [ "http://www.bridgebase.com/myhands/index.php" ] login_page = "http://www.bridgebase.com/myhands/myhands_login.php?t=%2fmyhands%2findex.php%3f" # authentication def init_request(self): return scrapy.http.request(url=self.login_page, callback=self.login) def login(self, response): return scrapy.http.formrequest.from_response( response, formdata={'username': 'username', 'password': 'password'}, callback=self.check_login_response) def check_login_response(self, response): if "recent tournaments" in response.body: self.log("login successful") return self.initialized() else: self.log("login failed") print(response.body) # pipe requests through splash js renders def start_requests(self): url in self.start_urls: yield scrapy.request(url, self.parse, meta={ 'splash': { 'endpoint': 'render.html', 'args': {'wait': 0.5} } }) # when link encountered rules = ( rule(linkextractor(), callback='parse_item'), ) # nothing on new link def parse_item(self, response): pass def parse(self, response): filename = 'test.html' open(filename, 'wb') f: f.write(response.body)
what's happening test.html
, result of parse()
, login page rather page i'm supposed redirected after login.
this telling in log--ordinarily, see "login successful" line check_login_response()
, can see below seems i'm not getting step. because scrapy putting authentication requests through splash too, , it's getting hung there? if that's case, there way bypass splash authentication part?
2016-01-24 14:54:56 [scrapy] info: spider opened 2016-01-24 14:54:56 [scrapy] info: crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-01-24 14:54:56 [scrapy] debug: telnet console listening on 127.0.0.1:6023 2016-01-24 14:55:02 [scrapy] debug: crawled (200) <post http://localhost:8050/render.html> (referer: none) 2016-01-24 14:55:02 [scrapy] info: closing spider (finished)
i'm pretty sure i'm not using splash correctly. can point me documentation can figure out what's going on?
i don't think splash alone handle particular case well.
here working idea:
- use
selenium
,phantomjs
headless browser log website - pass browser cookies
phantomjs
scrapy
the code:
import scrapy selenium import webdriver selenium.webdriver.common.by import selenium.webdriver.support.ui import webdriverwait selenium.webdriver.support import expected_conditions ec class bbospider(scrapy.spider): name = "bbo" allowed_domains = ["bridgebase.com"] login_page = "http://www.bridgebase.com/myhands/myhands_login.php?t=%2fmyhands%2findex.php%3f" def start_requests(self): driver = webdriver.phantomjs() driver.get(self.login_page) driver.find_element_by_id("username").send_keys("user") driver.find_element_by_id("password").send_keys("password") driver.find_element_by_name("submit").click() driver.save_screenshot("test.png") webdriverwait(driver, 10).until(ec.presence_of_element_located((by.link_text, "click here results of recent tournaments"))) cookies = driver.get_cookies() driver.close() yield scrapy.request("http://www.bridgebase.com/myhands/index.php", cookies=cookies) def parse(self, response): if "recent tournaments" in response.body: self.log("login successful") else: self.log("login failed") print(response.body)
prints login successful
, html of "hands" page.
Comments
Post a Comment