Date: 2011may30
Update: 2025oct13
Language: Java
Q. Java: Split a String into English words?
A. Use BreakIterator like this:
import java.text.BreakIterator;
import java.util.ArrayList;
class Demo {
static ArrayList<String> splitIntoWords(final String s) {
ArrayList<String> out = new ArrayList<>();
BreakIterator wordBreaker = BreakIterator.getWordInstance();
wordBreaker.setText(s);
int end = 0;
for (int start = wordBreaker.first(); (end = wordBreaker.next()) != BreakIterator.DONE; start = end) {
final String word = s.substring(start, end); // The so-called word includes spaces
final String trimmedWord = word.trim();
out .add(trimmedWord);
}
return out;
}
public static final void main(String[] args) {
var words = splitIntoWords("hello, world how are you?");
for (String word : words) {
System.out.println("word=" + word);
}
}
}
Output:
word=hello
word=,
word=
word=world
word=
word=how
word=
word=are
word=
word=you
word=?
As you can see, it keeps the spaces and makes punctuation its own word.
This is actually useful. You can easily discard this if you don't want them.
But they're there if you do want them.