Java Word Count Methods: Split and For Loop

This Java article uses split and a for-loop to count words in a String. It checks word boundaries with the Character class.
Count words. A String contains text divided into words. With a method, we can count the number of words in the String. This can be implemented in many ways.
With split, we use a regular expression pattern to separate likely words. Then we access the array's length. With a for-loop, we use the Character class to detect likely word separators.
Split implementation. Let us begin with the split() version. We introduce countWords: this method separates a String into an array of strings. We split on non-word chars.

Pattern: The regular expression pattern used, "W+" indicates one or more non-word characters.

If: An if-statement is used to detect a zero-word string. This logic works for the case tested, but may not always be enough.

Java program that implements countWords with split public class Program { public static int countWords(String value) { // Split on non-word chars. String[] words = value.split("\\W+"); // Handle an empty string. if (words.length == 1 && words[0].length() == 0) { return 0; } // Return array length. return words.length; } public static void main(String[] args) { String value = "To be or not to be, that is the question."; int count = countWords(value); System.out.println(count); value = "Stately, plump Buck Mulligan came from the stairhead"; count = countWords(value); System.out.println(count); System.out.println(countWords("")); } } Output 10 8 0
Loop version. Let us rewrite our previous countWords method. This version uses a simple loop. We use the Character class to detect certain word boundaries.ForCharacter

Complexity: This version of countWords has less computational complexity. It just loops through all characters once.

IsWhitespace: This method detects whether a char is considered whitespace (this includes paces, newlines and tabs).

IsLetterOrDigit: This is a convenient method. It returns true if we have a letter (either upper or lowercase) or a digit (like 1, 2 or 3).

Note: CountWords here detects a whitespace character, and if a word-start character follows it, the variable c is incremented.

Java program that implements countWords with loop public class Program { public static int countWords(String value) { int c = 0; for (int i = 1; i < value.length(); i++) { // See if previous char is a space. if (Character.isWhitespace(value.charAt(i - 1))) { // See if this char is a word start character. // ... Some punctuation chars can start a word. if (Character.isLetterOrDigit(value.charAt(i)) == true || value.charAt(i) == '"' || value.charAt(i) == '(') { c++; } } } if (value.length() > 2) { c++; } return c; } public static void main(String[] args) { String value = "To be or not to be, that is the question."; int count = countWords(value); System.out.println(count); value = "Stately, plump Buck Mulligan came from the stairhead"; count = countWords(value); System.out.println(count); System.out.println(countWords("")); } } Output 10 8 0
Some issues, for-loop. In the for-loop method (the second example) we have some issues. We check for certain punctuation characters, but more checks may need to be added.

Thus: We developed a good approach for a countWords method, but not an ideal implementation.

A review. In counting words, we require approximations. Some sequences, like numbers, may or may not be considered words. Hyphenated words too are an issue.
© 2007-2019 Sam Allen. Every person is special and unique. Send bug reports to info@dotnetperls.com.
HomeSearch
Home
Dot Net Perls