multithreading - Multithreaded DFS for web crawler in Java -


i writing web crawler in java using jsoup.

currently have single-threaded implementation uses depth-first search (it has crawl 1 domain have chosen either dfs or bfs, , opted dfs meant use queue instead of stack, , therefore use linkedblockingqueue when multithreaded version)

i have queue of links visit , hashset of already-visited links, , main loop pops link queue, visits page, , adds unvisited links page queue.

this contents of class implementing single-threaded implementation (if of throws declarations spurious please let me know why need grips that)

private static linkedblockingqueue<string> urlstocrawl = new linkedblockingqueue<string>(); private static string baseurl; private static string httpsbaseurl; private static hashset<string> alreadycrawledset = new hashset<string>(); private static list<string> deadlinks = new linkedlist<string>();  public static void main(string[] args) throws ioexception, interruptedexception {      // should output site map, showing static assets each page.       validate.istrue(args.length == 1, "usage: supply url fetch");      baseurl = args[0];     httpsbaseurl = baseurl.replace("http://", "https://");      alreadycrawledset.add(baseurl);     urlstocrawl.add(baseurl);      while (!urlstocrawl.isempty() ) {         string url = urlstocrawl.take();         crawlurl(url);     }   }  private static void crawlurl(string url) throws ioexception, interruptedexception {     print("%s", url);     try {         document doc = jsoup.connect(url).get();         elements links = doc.select("a[href]");          (element link : links) {             string linkurl = link.attr("abs:href");             if (samedomain(linkurl) && !alreadycrawled(linkurl)) {                 alreadycrawledset.add(linkurl);                 urlstocrawl.put(linkurl);             }         }     } catch (httpstatusexception e) {         deadlinks.add(url);     } }     private static boolean alreadycrawled(string url) {     if (alreadycrawledset.contains(url)) {         return true;     } else {         return false;     } } 

i make multithreaded, take advantage of fact single-threaded implementation has wait http request in jsoup.connect(url).get() call return before continuing process. hoping allowing multiple threads act @ once, work done during i/o-bound delay, , therefore speed program.

i not experienced concurrency - first thought create executor , submit every call crawlurl it. confused - don't know how make sure hashset , queue accessed in threadsafe manner, given each thread not consumes urls queue pushes new urls onto queue.

i understand basics of concepts of atomicity, , idea threads can "lock" shared resources don't know how put them practice in scenario.

does have advice making multithreaded?

my solution deal 1 layer of graph @ time. each level, submit each link executorservice crawled (multithreaded) wait level completed (using countdownlatch) before moving onto next level.

i used fixedthreadpool form of rate limiting.

(initially tried dispatch every url asynchronously, must more efficient, couldn't figure out how terminate whole thing.)


Comments