Friday, 4 April 2014

Scan for a set of URLS with Jsoup.

Snippet to load a list of urls, scan them for link tags, and output them as a file - with Jsoup.


public void run(final String in, final String out, final String match) throws IOException {

        final StringJoiner joiner = new StringJoiner("\n");

        try (BufferedReader br = new BufferedReader(new FileReader(in))) {
            for (String line; (line = br.readLine()) != null; ) {
                try {
                    System.out.println("Scanning: "+line);
                    for (final Element link : Jsoup.connect(line)
                           .timeout(0)
                           .get()
                           .select(match)) {
                        joiner.add(link.attr("href"));
                    }
                } catch (Exception e) {
                    System.err.println("Error: " + e.getMessage());
                }
            }
        }

        Files.write(Paths.get(out), joiner.toString().getBytes());

    }

Where the 'match' arg will be a Jsoup matching pattern such as: a[href^=http://].

No comments:

Post a Comment