In a String
, some characters like the space, newlines, and tabs are considered whitespace. These special chars can cause problems.
For example, multiple whitespace chars together may need to combined (condensed). And Windows and UNIX newlines may need to be normalized (converted).
Remove
, condense whitespaceThis program changes whitespace in Strings. It removes all whitespace chars. And it can collapse whitespace.
replaceAll
method. It removes all whitespace chars.string
.public class Program { static String removeAllWhitespace(String value) { // Remove all whitespace characters. return value.replaceAll("\\s", ""); } static String collapseWhitespace(String value) { // Replace all whitespace blocks with single spaces. return value.replaceAll("\\s+", " "); } public static void main(String[] args) { String value = " Hi,\r\n\t\tA B C"; // Test our methods. String result = removeAllWhitespace(value); System.out.println(result); result = collapseWhitespace(value); System.out.println(result); } }Hi,ABC Hi, A B C
toCharArray
This example uses another approach to whitespace. It converts a String
to a char
array and changes the array's elements.
public class Program { static String convertWhitespaceToSpaces(String value) { // Convert String to a character array. char[] array = value.toCharArray(); for (int i = 0; i < array.length; i++) { // Modify all newlines and tabs to be spaces. switch (array[i]) { case '\r': case '\n': case '\t': array[i] = ' '; break; } } // Return the modified string. return new String(array); } public static void main(String[] args) { String value = "A B\nC D\tE F"; // Test the conversion method. System.out.println(convertWhitespaceToSpaces(value)); } }A B C D E F
UNIX uses just one character, \n for newlines. But Windows uses two—the \r\n sequence. We can convert Windows newlines to UNIX ones.
String
length is reduced by 2 chars.public class Program { static String convertToUNIXNewlines(String value) { // Normalize the newlines in the String. return value.replace("\r\n", "\n"); } public static void main(String[] args) { // This string contains both Windows and UNIX newlines. String value = "A B\r\nC\r\nD\nE"; // Replace Windows newlines. String result = convertToUNIXNewlines(value); // Write length before and after. System.out.println(value.length()); System.out.println(result.length()); } }11 9
This method handles the reverse conversion: it converts from UNIX to Windows newlines. This is more complex.
String
has the correct number of newline chars.public class Program { static String convertToUNIXNewlines(String value) { return value.replace("\r\n", "\n"); } static String convertToWindowsNewlines(String value) { // Convert to UNIX lines to normalize all newlines. // ... Then replace with Windows newlines. value = convertToUNIXNewlines(value); return value.replace("\n", "\r\n"); } public static void main(String[] args) { // This string contains 2 UNIX newlines. String value = "Cat\nDog\nFish\r\nBird"; String result = convertToWindowsNewlines(value); // Write lengths. // ... The two UNIX newlines were converted. // ... The Windows newline was ignored. System.out.println(value.length()); System.out.println(result.length()); } }18 20
This program converts a file's newlines to UNIX newlines. It reads the file in as a byte
array and converts it to a string
.
String
back into a byte
array and writes it to the same location, replacing the original file.import java.io.IOException; import java.nio.file.FileSystems; import java.nio.file.Files; import java.nio.file.Path; public class Program { static String convertToUNIXNewlines(String value) { return value.replace("\r\n", "\n"); } public static void main(String[] args) throws IOException { // Get a Path object. Path path = FileSystems.getDefault().getPath("C:\\programs\\file.txt"); // Read all bytes for the file and convert it to a string. byte[] data = Files.readAllBytes(path); String data2 = new String(data); // Fix newlines. data2 = convertToUNIXNewlines(data2); // Write converted bytes. byte[] data3 = data2.getBytes(); Files.write(path, data3); } }
Trim
For leading or trailing spaces, the trim()
method is ideal. It does not require any special code. This is like a chomp()
or chop()
method in other languages.
In processing text, we remove or combine whitespace. These methods help with that task. Further steps may (for example) remove stopwords.