Break Iterators

Internationalized text-processing applications (such as word processors) need to detect logical boundaries within the text they are manipulating. For example, a word processor needs to detect these boundaries when highlighting a character, selecting a word to cut to the clipboard, moving the caret (text insertion point indicator) to the start of the next sentence, and wrapping a word at the end of a line.

Java provides the Break Iterator API with its abstract java.text.Breaklterator entry-point class to detect text boundaries.

Breaklterator declares the following factory methods for obtaining break iterators that detect character, word, sentence, and line boundaries:

■ static Breaklterator getCharacterInstance()

■ static Breaklterator getWordInstance()

■ static Breaklterator getSentenceInstance()

■ static Breaklterator getLineInstance()

Each of these factory methods returns a break iterator for the default locale. If you need a break iterator for a specific locale, you can call the following factory methods:

■ static BreakIterator getCharacterInstance(Locale locale)

■ static BreakIterator getWordInstance(Locale locale)

■ static BreakIterator getSentenceInstance(Locale locale)

■ static BreakIterator getLineInstance(Locale locale)

Each of these factory methods throws NullPointerException when its locale argument is null.

BreakIterator's locale-sensitive factory methods might not support every locale. For this reason, you should only pass Locale objects that are also stored in the array returned from this class's static Locale[] getAvailableLocales() method (which is also declared in other entry-point classes) to the aforementioned factory methods—this array contains at least Locale.US. Check out Listing 9-15.

Listing 9-15. Obtaining BreakIterator's supported locales and passing the first locale (possibly Locale.US) to getCharacterInstance(Locale)

Locale[] supportedLocales = BreakIterator.getAvailableLocales(); BreakIterator bi = BreakIterator.getCharacterInstance(supportedLocales[0]);

A BreakIterator instance has an imaginary cursor that points to the current boundary within a text string. This cursor position can be interrogated and the cursor moved from boundary to boundary with the help of the following BreakIterator methods:

■ abstract int current() returns the text boundary that was most recently returned by next(), next(int), previous(), first(), last(), following(int), or preceding(int). If any of these methods returns BreakIterator.DONE because either the first or the last text boundary has been reached, current() returns the first or last text boundary depending on which one was reached.

■ abstract int first() returns the first text boundary. The iterator's current position is set to this boundary.

■ abstract int following(int offset) returns the first text boundary following the specified character offset. If offset equals the last text boundary, following(int) returns BreakIterator.DONE and the iterator's current position is unchanged. Otherwise, the iterator's current position is set to the returned text boundary. The value returned is always greater than offset or BreakIterator.DONE.

■ abstract int last() returns the last text boundary. The iterator's current position is set to this boundary.

■ abstract int next() returns the text boundary following the current boundary. If the current boundary is the last text boundary, next() returns BreakIterator.DONE and the iterator's current position is unchanged. Otherwise, the iterator's current position is set to the boundary following the current boundary.

■ abstract int next(int n) returns the nth text boundary from the current boundary. If either the first or the last text boundary has been reached, next(int) returns BreakIterator.DONE and the current position is set to either the first or last text boundary depending on which one is reached. Otherwise, the iterator's current position is set to the new text boundary.

■ int preceding(int offset) returns the last text boundary preceding the specified character offset. If offset equals the first text boundary, preceding(int) returns BreakIterator.DONE and the iterator's current position is unchanged. Otherwise, the iterator's current position is set to the returned text boundary. The returned value is always less than offset or equals BreakIterator.DONE. (This method was added to BreakIterator in Java version 1.2. It could not be declared abstract because abstract methods cannot be added to existing classes; such methods would also have to be implemented in subclasses that might be inaccessible.)

■ abstract int previous() returns the text boundary preceding the current boundary. If the current boundary is the first text boundary, previous() returns BreakIterator.DONE and the iterator's current position is unchanged. Otherwise, the iterator's current position is set to the boundary preceding the current boundary.

Figure 9-4 reveals that characters are located between boundaries, boundaries are zero-based, and the last boundary is the length of the string.

Figure 9-4. JAVA's character boundaries as reported by the next() andprevious() methods

BreakIterator also declares a void setText(String newText) method that identifies newText as the text to be iterated over. This method resets the cursor position to the beginning of this string.

Listing 9-16 shows you how to use a character-based break iterator to iterate over a string's characters in a locale-independent manner.

Listing 9-16. Iterating over English/US and Arabic/Saudi Arabia strings import java.text.BreakIterator;

import java.util.Locale;

public class BreakIteratorDemo y

public static void main(String[] args) {

BreakIterator bi = BreakIterator.getCharacterInstance(Locale.US);

bi.setText("JAVA");

dumpPositions(bi);

bi = BreakIterator.getCharacterInstance(new Locale("ar", "SA"));

bi.setText("\u0631\u0641\u0651");

dumpPositions(bi);

static void dumpPositions(BreakIterator bi) {

while (boundary != BreakIterator.DONE) {

System.out.print(boundary + " "); boundary = bi.next();

System.out.println();

The main() method first obtains a character-based break iterator for the United States locale. main() then calls the iterator's setText() method to specify JAVA as the text to be iterated over.

Iteration occurs in the dumpPositions() method. After calling first() to obtain the first boundary, this method uses a while loop to output the boundary and move to the next boundary (via next()) while the current boundary does not equal BreakIterator.DONE.

Because character iteration is straightforward for English words, main() next obtains a character-based break iterator for the Saudi Arabia locale, and uses this iterator to iterate over the characters in Figure 9-5's Arabic version of "shelf" (as in shelf of books).

shadda (diacritic)

Figure 9-5. The letters and diacritic making up the Arabic equivalent of "shelf" are written from right to left.

In Arabic, the word "shelf" consists of letters resh and pe, and diacritic shadda. A diacritic is an ancillary glyph, or mark on paper or other writing medium, added to a letter, or basic glyph. Shadda, which is shaped like a small written Latin w, indicates gemination (consonant doubling or extra length), which is phonemic (the smallest identifiable discrete unit of sound employed to form meaningful contrasts between utterances) in Arabic. Shadda is written above the consonant that is to be doubled, which happens to be pe in this example.

When you run this application, it generates the following output:

The first output line reveals Figure 9-4's character boundaries for the word JAVA. The second output line (0 comes before resh, 1 comes before pe) implies that you cannot move an Arabic word processor's caret on the screen once for every Unicode character. Instead, it is moved once for every user character, a logical character that can be composed of multiple Unicode characters, such as pe (\u0641) and shadda (\u0651).

NOTE: For examples of break iterators that iterate over words, sentences, and lines, check out the "Detecting Text Boundaries" section

(http://download.oracle.com/docs/cd/E17409_01/javase/tutorial/i18n/text/ boundaryintro.html) in The Java Tutorials' Internationalization trail.

Was this article helpful?

0 0

Post a comment