python - Using InitSpider with splash: only parsing the login page? -

this sort of follow-up question one asked earlier.

i'm trying scrape webpage have login reach first. after authentication, webpage need requires little bit of javascript run before can view content. i've done followed instructions here install splash try render javascript. however...

before switched splash, authentication scrapy's initspider fine. getting through login page , scraping target page ok (except without javascript working, obviously). once add code pass requests through splash, looks i'm not parsing target page.

spider below. difference between splash version (here) , non-splash version function def start_requests(). else same between two.

import scrapy scrapy.spiders.init import initspider scrapy.spiders import rule scrapy.linkextractors import linkextractor  class bbospider(initspider):     name = "bbo"     allowed_domains = ["bridgebase.com"]     start_urls = [             "http://www.bridgebase.com/myhands/index.php"             ]     login_page = "http://www.bridgebase.com/myhands/myhands_login.php?t=%2fmyhands%2findex.php%3f"       # authentication     def init_request(self):         return scrapy.http.request(url=self.login_page, callback=self.login)      def login(self, response):         return scrapy.http.formrequest.from_response(             response,             formdata={'username': 'username', 'password': 'password'},             callback=self.check_login_response)      def check_login_response(self, response):         if "recent tournaments" in response.body:             self.log("login successful")             return self.initialized()         else:             self.log("login failed")             print(response.body)      # pipe requests through splash js renders      def start_requests(self):         url in self.start_urls:             yield scrapy.request(url, self.parse, meta={                 'splash': {                     'endpoint': 'render.html',                     'args': {'wait': 0.5}                 }             })       # when link encountered     rules = (             rule(linkextractor(), callback='parse_item'),             )      # nothing on new link     def parse_item(self, response):         pass      def parse(self, response):         filename = 'test.html'          open(filename, 'wb') f:             f.write(response.body)

what's happening test.html, result of parse(), login page rather page i'm supposed redirected after login.

this telling in log--ordinarily, see "login successful" line check_login_response(), can see below seems i'm not getting step. because scrapy putting authentication requests through splash too, , it's getting hung there? if that's case, there way bypass splash authentication part?

2016-01-24 14:54:56 [scrapy] info: spider opened 2016-01-24 14:54:56 [scrapy] info: crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-01-24 14:54:56 [scrapy] debug: telnet console listening on 127.0.0.1:6023 2016-01-24 14:55:02 [scrapy] debug: crawled (200) <post http://localhost:8050/render.html> (referer: none) 2016-01-24 14:55:02 [scrapy] info: closing spider (finished)

i'm pretty sure i'm not using splash correctly. can point me documentation can figure out what's going on?

i don't think splash alone handle particular case well.

here working idea:

use selenium , phantomjs headless browser log website
pass browser cookies phantomjs scrapy

the code:

import scrapy selenium import webdriver selenium.webdriver.common.by import selenium.webdriver.support.ui import webdriverwait selenium.webdriver.support import expected_conditions ec   class bbospider(scrapy.spider):     name = "bbo"     allowed_domains = ["bridgebase.com"]     login_page = "http://www.bridgebase.com/myhands/myhands_login.php?t=%2fmyhands%2findex.php%3f"      def start_requests(self):         driver = webdriver.phantomjs()         driver.get(self.login_page)          driver.find_element_by_id("username").send_keys("user")         driver.find_element_by_id("password").send_keys("password")          driver.find_element_by_name("submit").click()          driver.save_screenshot("test.png")         webdriverwait(driver, 10).until(ec.presence_of_element_located((by.link_text, "click here results of recent tournaments")))          cookies = driver.get_cookies()         driver.close()          yield scrapy.request("http://www.bridgebase.com/myhands/index.php", cookies=cookies)      def parse(self, response):         if "recent tournaments" in response.body:             self.log("login successful")         else:             self.log("login failed")         print(response.body)

prints login successful , html of "hands" page.

Shah

Search This Blog

python - Using InitSpider with splash: only parsing the login page? -

Comments

Post a Comment