python - How to get a structured JSON output with Scrapy? -

i newbie python , trying use scrapy scrape website multiple pages , below code segment "spider.py"

    def parse(self, response):         sel = selector(response)         tuples = sel.xpath('//*[td[@class = "caption"]]')         items = []          tuple in tuples:             item = datatuple()              keytemp = tuple.xpath('td[1]').extract()[0]             key = html2text.html2text(keytemp).rstrip()             valuetemp = tuple.xpath('td[2]').extract()[0]             value = html2text.html2text(valuetemp).rstrip()              item[key] = value             items.append(item)     return items

by running code command:

scrapy crawl dumbspider -o items.json -t json

it give out:

{"a":"a-value"}, {"b":"b-value"}, {"c":"c-value"}, {"a":"another-a-value"}, {"b":"another-b-value"}, {"c":"another-c-value"}

but want like:

{"a":"a-value", "b":"b-value", "c":"c-value"}, {"a":"another-a-value", "b":"another-b-value", "c":"another-c-value"}

i tried few ways tweak spider.py example using temporary list store "item" of single webpage , append temporary list "items" somehow doesn't work.

updated: indentation fixed.

below i've done quick mockup of how i'd recommend doing long know number of td per page. can take or of see fit. over-engineered question (sorry!); take chunk_by_numbers bit , done....

a few things note:

1) avoid using 'tuple' variable name, it's internal keyword

2) learn use generators/built-ins, they'll faster , lighter if you're doing alot of sites @ once (see parse_to_kv , chunk_by_number below)

3) try isolate parsing logic if changes, can swap out @ 1 place (see extract_td below)

4) function not use 'self', should use @staticmethod decorator , remove parameter function

5) @ moment output dict, import json , dump if need json object

def extract_td(item, index):     # extract logic websites allows extraction     # of either key or value table data     # returns string representation of item[index]     # page/tool specific!     td_as_str = "td[%i]" % index     val = item.xpath(td_as_str).extract()[0]     return html2text.html2text(val).rstrip()  def parse_to_kv(xpaths):     # returns key, value pairs given     # page specific     xpath in xpaths:         yield extract_td(xpath, 0), extract_td(xpath, 1)  def chunk_by_number(alist, num):     # splices alist chunks of num size.     # generic, reusable operation     chunk in list(zip(*(iter(alist),) * num)):         yield chunk  def parse(response, td_per_page):     # extracts key/value pairs based on table datas in response     # yields lists of length td_per_page contain these key/value extractions     # specific based on our parse patterns     sel = selector(response)     tuples = sel.xpath('//*[td[@class = "caption"]]')     kv_generator = parse_to_kv(tuples)      page in chunk_by_number(kv_generator, td_per_page):         print dict(page)

Shah

Search This Blog

python - How to get a structured JSON output with Scrapy? -

Comments

Post a Comment