i newbie python , trying use scrapy scrape website multiple pages , below code segment "spider.py"
def parse(self, response): sel = selector(response) tuples = sel.xpath('//*[td[@class = "caption"]]') items = [] tuple in tuples: item = datatuple() keytemp = tuple.xpath('td[1]').extract()[0] key = html2text.html2text(keytemp).rstrip() valuetemp = tuple.xpath('td[2]').extract()[0] value = html2text.html2text(valuetemp).rstrip() item[key] = value items.append(item) return items
by running code command:
scrapy crawl dumbspider -o items.json -t json
it give out:
{"a":"a-value"}, {"b":"b-value"}, {"c":"c-value"}, {"a":"another-a-value"}, {"b":"another-b-value"}, {"c":"another-c-value"}
but want like:
{"a":"a-value", "b":"b-value", "c":"c-value"}, {"a":"another-a-value", "b":"another-b-value", "c":"another-c-value"}
i tried few ways tweak spider.py example using temporary list store "item" of single webpage , append temporary list "items" somehow doesn't work.
updated: indentation fixed.
below i've done quick mockup of how i'd recommend doing long know number of td per page. can take or of see fit. over-engineered question (sorry!); take chunk_by_numbers bit , done....
a few things note:
1) avoid using 'tuple' variable name, it's internal keyword
2) learn use generators/built-ins, they'll faster , lighter if you're doing alot of sites @ once (see parse_to_kv , chunk_by_number below)
3) try isolate parsing logic if changes, can swap out @ 1 place (see extract_td below)
4) function not use 'self', should use @staticmethod decorator , remove parameter function
5) @ moment output dict, import json , dump if need json object
def extract_td(item, index): # extract logic websites allows extraction # of either key or value table data # returns string representation of item[index] # page/tool specific! td_as_str = "td[%i]" % index val = item.xpath(td_as_str).extract()[0] return html2text.html2text(val).rstrip() def parse_to_kv(xpaths): # returns key, value pairs given # page specific xpath in xpaths: yield extract_td(xpath, 0), extract_td(xpath, 1) def chunk_by_number(alist, num): # splices alist chunks of num size. # generic, reusable operation chunk in list(zip(*(iter(alist),) * num)): yield chunk def parse(response, td_per_page): # extracts key/value pairs based on table datas in response # yields lists of length td_per_page contain these key/value extractions # specific based on our parse patterns sel = selector(response) tuples = sel.xpath('//*[td[@class = "caption"]]') kv_generator = parse_to_kv(tuples) page in chunk_by_number(kv_generator, td_per_page): print dict(page)
Comments
Post a Comment