i writing web crawler in java using jsoup.
currently have single-threaded implementation uses depth-first search (it has crawl 1 domain have chosen either dfs or bfs, , opted dfs meant use queue instead of stack, , therefore use linkedblockingqueue
when multithreaded version)
i have queue
of links visit , hashset
of already-visited links, , main loop pops link queue, visits page, , adds unvisited links page queue.
this contents of class implementing single-threaded implementation (if of throws
declarations spurious please let me know why need grips that)
private static linkedblockingqueue<string> urlstocrawl = new linkedblockingqueue<string>(); private static string baseurl; private static string httpsbaseurl; private static hashset<string> alreadycrawledset = new hashset<string>(); private static list<string> deadlinks = new linkedlist<string>(); public static void main(string[] args) throws ioexception, interruptedexception { // should output site map, showing static assets each page. validate.istrue(args.length == 1, "usage: supply url fetch"); baseurl = args[0]; httpsbaseurl = baseurl.replace("http://", "https://"); alreadycrawledset.add(baseurl); urlstocrawl.add(baseurl); while (!urlstocrawl.isempty() ) { string url = urlstocrawl.take(); crawlurl(url); } } private static void crawlurl(string url) throws ioexception, interruptedexception { print("%s", url); try { document doc = jsoup.connect(url).get(); elements links = doc.select("a[href]"); (element link : links) { string linkurl = link.attr("abs:href"); if (samedomain(linkurl) && !alreadycrawled(linkurl)) { alreadycrawledset.add(linkurl); urlstocrawl.put(linkurl); } } } catch (httpstatusexception e) { deadlinks.add(url); } } private static boolean alreadycrawled(string url) { if (alreadycrawledset.contains(url)) { return true; } else { return false; } }
i make multithreaded, take advantage of fact single-threaded implementation has wait http request in jsoup.connect(url).get()
call return before continuing process. hoping allowing multiple threads act @ once, work done during i/o-bound delay, , therefore speed program.
i not experienced concurrency - first thought create executor
, submit every call crawlurl
it. confused - don't know how make sure hashset
, queue
accessed in threadsafe manner, given each thread not consumes urls queue
pushes new urls onto queue
.
i understand basics of concepts of atomicity, , idea threads can "lock" shared resources don't know how put them practice in scenario.
does have advice making multithreaded?
my solution deal 1 layer of graph @ time. each level, submit each link executorservice
crawled (multithreaded) wait level completed (using countdownlatch
) before moving onto next level.
i used fixedthreadpool
form of rate limiting.
(initially tried dispatch every url asynchronously, must more efficient, couldn't figure out how terminate whole thing.)
Comments
Post a Comment