Home
Java
Word Count
This page was last reviewed on Dec 5, 2022.
Dot Net Perls
Count words. A String contains text divided into words. With a method, we can count the number of words in the String. This can be implemented in many ways.
With split, we use a regular expression pattern to separate likely words. Then we access the array's length. With a for-loop, we use the Character class to detect likely word separators.
String split
for
Split implementation. Let us begin with the split() version. We introduce countWords: this method separates a String into an array of strings. We split on non-word chars.
Detail The regular expression pattern used, "W+" indicates one or more non-word characters.
Detail An if-statement is used to detect a zero-word string. This logic works for the case tested, but may not always be enough.
public class Program { public static int countWords(String value) { // Split on non-word chars. String[] words = value.split("\\W+"); // Handle an empty string. if (words.length == 1 && words[0].length() == 0) { return 0; } // Return array length. return words.length; } public static void main(String[] args) { String value = "To be or not to be, that is the question."; int count = countWords(value); System.out.println(count); value = "Stately, plump Buck Mulligan came from the stairhead"; count = countWords(value); System.out.println(count); System.out.println(countWords("")); } }
10 8 0
Loop version. Let us rewrite our previous countWords method. This version uses a simple loop. We use the Character class to detect certain word boundaries.
Character
Detail This version of countWords has less computational complexity. It just loops through all characters once.
Detail This method detects whether a char is considered whitespace (this includes paces, newlines and tabs).
Detail This is a convenient method. It returns true if we have a letter (either upper or lowercase) or a digit (like 1, 2 or 3).
Note CountWords here detects a whitespace character, and if a word-start character follows it, the variable "c" is incremented.
public class Program { public static int countWords(String value) { int c = 0; for (int i = 1; i < value.length(); i++) { // See if previous char is a space. if (Character.isWhitespace(value.charAt(i - 1))) { // See if this char is a word start character. // ... Some punctuation chars can start a word. if (Character.isLetterOrDigit(value.charAt(i)) == true || value.charAt(i) == '"' || value.charAt(i) == '(') { c++; } } } if (value.length() > 2) { c++; } return c; } public static void main(String[] args) { String value = "To be or not to be, that is the question."; int count = countWords(value); System.out.println(count); value = "Stately, plump Buck Mulligan came from the stairhead"; count = countWords(value); System.out.println(count); System.out.println(countWords("")); } }
10 8 0
Some issues, for-loop. In the for-loop method (the second example) we have some issues. We check for certain punctuation characters, but more checks may need to be added.
Thus We developed a good approach for a countWords method, but not an ideal implementation.
A review. In counting words, we require approximations. Some sequences, like numbers, may or may not be considered words. Hyphenated words too are an issue.
Dot Net Perls is a collection of tested code examples. Pages are continually updated to stay current, with code correctness a top priority.
Sam Allen is passionate about computer languages. In the past, his work has been recommended by Apple and Microsoft and he has studied computers at a selective university in the United States.
No updates found for this page.
Home
Changes
© 2007-2024 Sam Allen.