Skip to content Skip to sidebar Skip to footer

JSOUP Finding Groups Of Words

For a homework assignment I have to write a program that scraps HTML from a website and then somehow find phrases within the website. When I say phrases I mean some sort of arbitra

Solution 1:

What you are looking for is a concept called stemming. From wikipedia

A stemmer for English, for example, should identify the string "cats" (and possibly "catlike", "catty" etc.) as based on the root "cat", and "stemmer", "stemming", "stemmed" as based on "stem". A stemming algorithm reduces the words "fishing", "fished", "fish", and "fisher" to the root word, "fish".

You an provide a simple brute force implementation for this. Also checkout the stemming algorithm implementations from Lucene and OpenNLP


Solution 2:

Since your question is very unclear, my answer is not perfect in anyway. Infact this this more of a suggestion than an answer as a comment may not be that big.

This is an idea based on your following definition of phrases - When I say phrases I mean some sort of arbitrary way of organizing text so that words that are in close proximity to each other are put in the same group

What I think you are required to do is to 'separate' out distinct pieces of text from the html as much is possible from the html. There cannot be hundred percent sure way to achieve this because html in itself can so complex that parsing it in such a way may become extremly difficult if not possible.

Here is one suggestion that came to my mind - find continous pieces of text in html that have no tags in them. This can be easily done by simple regex, if you are using jsoup, you can do something like this -

String html = doc.body().toString();
Matcher m = Pattern.compile("([^<>]+)").matcher(html);
while(m.find()) {
    String text = f.group(1);
}

But this may alone not always work as some intermittent html decorating for font changes and even bold and italics markers can 'break' these phrases. So you may want to build some sort of resilience to ignore such things.

Or maybe you can find 'tag distance' of one piece of text from another. That is count number of html tags that appear between pieces of text and may consider pieces togeather if they are just one or maybe 2-3 tags apart.

And finally you are free to throw in some of your own creativity into evolving this approach. Again I would like to mention it is just a suggestion for you to build something on. All the best.


Post a Comment for "JSOUP Finding Groups Of Words"