Code The Nitty Gritty Details

Here's another chicken-and-egg situation: we only want to get to know the Android APIs relevant for game programming. But we still don't know how to actually program a game. We have an idea of how to design one, but transforming it into an executable is still voodoo magic to us. In the following subsections, I want to give you an overview of what a game is usually composed of. We'll look at some pseudocode for interfaces we'll later implement with what Android offers us. Interfaces are awesome for two reasons: they allow us to concentrate on the semantics without needing to know the implementation details, and they allow us to later exchange the implementation (e.g., instead of using 2D CPU rendering, we could exploit OpenGL ES to display Mr. Nom on the screen).

Every game needs some basic framework that abstracts away and eases the pain of communicating with the underlying operating system. Usually this is split up into modules, as follows:

Window management: This is responsible for creating a window and coping with things like closing the window or pausing/resuming the application on Android.

Input: This is related to the window management module, and keeps track of user input (e.g., touch events, keystrokes, and accelerometer readings).

File I/O: This allows us to get the bytes of our assets into our program from disk.

Graphics: This is probably the most complex module besides the actual game. It is responsible for loading graphics and drawing them on the screen.

Audio: This module is responsible for loading and playing everything that will hit our ears.

Game framework: This ties all the above together and provides an easy-to-use base to write our games.

Each of these modules is composed of one or more interfaces. Each interface will have at least one concrete implementation that implements the semantics of the interface based on what the underlying platform (in our case Android) provides us with.

NOTE: Yes, I deliberately left out networking from the preceding list. We will not implement multiplayer games in this book, I'm afraid. That is a rather advanced topic depending on the type of game. If you are interested in this topic, you can find a range of tutorials on the Web. (www.gamedev.net is a good place to start).

In the following discussion we will be as platform agnostic as possible. The concepts are the same on all platforms.

Application and Window Management

A game is just like any other computer program that has a UI. It is contained in some sort of window (if the underlying operating system's UI paradigm is window based, which is the case on all mainstream operating systems). The window serves as a container, and we basically think of it as a canvas that we draw our game content on.

Most operating systems allow the user to interact with the window in a special way besides touching the client area or pressing a key. On desktop systems you can usually drag the window around, resize it or minimize it to some sort of taskbar. On Android, resizing is replaced with accommodating an orientation change, and minimizing is similar to putting the application in the background via a press of the home button or as a reaction to an incoming call.

The application and window management module is also responsible for actually setting up the window and making sure it is filled by a single UI component that we can later render to and that receives input from the user in the form of touching or pressing keys. That UI component might be rendered to via the CPU or it can be hardware accelerated as it is the case with OpenGL ES.

The application and window management module does not have a concrete set of interfaces. We'll merge it with the game framework later on. What we have to remember are the application states and window events that we have to manage:

Create: Called once when the window (and thus the application) is started up.

Pause: Called when the application is paused by some mechanism.

Resume: Called when the application is resumed and the window is in the foreground again.

NOTE: Some Android aficionados might roll their eyes at this point. Why only use a single window (activity in Android speak)? Why not use more than one UI widget for the game—say, for implementing complex UIs that our game might need? The main reason is that we want complete control over the look and feel of our game. It also allows me to focus on Android game programming instead of Android UI programming, a topic for which better books exist—for example, Mark Murphy's excellent Beginning Android 2 (Apress, 2010).

Input

The user will surely want to interact with our game in some way. That's where the input module comes in. On most operating systems, input events such as touching the screen or pressing a key are dispatched to the currently focused window. The window will then further dispatch the event to the UI component that has the focus. The dispatching process is usually transparent to us; all we need to care about is getting the events from the focused UI component. The UI APIs of the operating system provide a mechanism to hook into the event dispatching system so we can easily register and record the events. This hooking into and recording of events is the main task of the input module.

What can we do with the recorded information? There are two modi operandi:

Polling: With polling, we only check the current state of the input devices. Any states between the current check and the last check will be lost. This way of input handling is suitable for checking things like whether a user touches a specific button, for example. It is not suitable for tracking text input, as the order of key events is lost.

Event-based handling: This gives us a full chronological history of the events that have occurred since we last checked. It is a suitable mechanism to perform text input or any other task that relies on the order of events. It's also useful to detect when a finger first touched the screen or when it was lifted.

What input devices do we want to handle? On Android, we have three main input methods: touchscreen, keyboard/trackball, and accelerometer. The first two are suitable for both polling and event-based handling. The accelerometer is usually just polled. The touchscreen can generate three events:

Touch down: This happens when a finger is touched to the screen.

Touch drag: This happens when a finger is dragged across the screen. Before a drag there's always a down event.

Touch up: This happens when a finger is lifted from the screen.

Each touch event has additional information: the position relative to the UI components origin, and a pointer index used in multitouch environments to identify and track separate fingers.

The keyboard can generate two types of events:

Key down: This happens when a key is pressed down.

Key up: This happens when a key is lifted. This event is always preceded by a key-down event.

Key events also carry additional information. Key-down events store the pressed key's code. Key-up events store the key's code and an actual Unicode character. There's a difference between a key's code and the Unicode character generated by a key-up event. In the latter case, the state of other keys are also taken into account, such as the Shift key. This way, we can get upper- and lowercase letters in a key-up event, for example. With a key-down event, we only know that a certain key was pressed; we have no information on what character that keypress would actually generate.

Finally, there's the accelerometer. We will always poll the accelerometer's state. The accelerometer reports the acceleration exerted by the gravity of our planet on one of three axes of the accelerometer. The axes are called x, y, and z. Figure 3-19 depicts each axis's orientation. The acceleration on each axis is expressed in meters per second squared (m/s2). From our physics class, we know that an object will accelerate at roughly 9.8 m/s2 when in free fall on planet Earth. Other planets have a different gravity, so the acceleration constant is also different. For the sake of simplicity, we'll only deal with planet Earth here. When an axis points away from the center of the Earth, the maximum acceleration is applied to it. If an axis points toward the center of the Earth, we get a negative maximum acceleration. If you hold your phone upright in portrait mode, then the y-axis will report an acceleration of 9.8 m/s2, for example. In Figure 3-19, the z-axis would report an acceleration of 9.8 m/s2, and the x- and y-axes would report and acceleration of zero.

Figure 3-19. The accelerometer axes on an Android phone. The z-axis points out of the phone.

Now let's define an interface that gives us polling access to the touchscreen, the keyboard, and the accelerometer, and gives us event-based access to the touchscreen and keyboard (see Listing 3-1).

Listing 3-1. The Input Interface and the KeyEvent and TouchEvent Classes package com.badlogic.androidgames.framework;

import java.util.List;

public interface Input {

public static class KeyEvent {

public static final int KEY_DOWN = 0;

public static final int KEY_UP = 1;

public int type; public int keyCode; public char keyChar;

public static class TouchEvent {

public static final int TOUCH_DOWN = 0; public static final int TOUCH_UP = 1; public static final int TOUCH_DRAGGED = 2;

public int type; public int x, y; public int pointer;

public boolean isKeyPressed(int keyCode); public boolean isTouchDown(int pointer); public int getTouchX(int pointer); public int getTouchY(int pointer); public float getAccelX(); public float getAccelY(); public float getAccelZ(); public List<KeyEvent> getKeyEvents(); public List<TouchEvent> getTouchEvents();

Our definition is started off by two classes, KeyEvent and TouchEvent. The KeyEvent class defines constants that encode a KeyEvent's type; the TouchEvent class does the same. A KeyEvent instance records its type, the key's code, and its Unicode character in case the the event's type is KEY_UP.

The TouchEvent code is similar, and holds the TouchEvent's type, the position of the finger relative to the UI component's origin, and the pointer ID that was given to the finger by the touchscreen driver. The pointer ID for a finger will stay the same for as long as that finger is on the screen. The first finger that goes down gets the pointer ID 0, the next the ID 1, and so on. If two fingers are down and finger 0 is lifted, then finger 1 keeps its ID for as long as it is touching the screen. A new finger will get the the first free ID, which would be 0 in this example.

Next are the polling methods of the Input interface, which should be pretty self-explanatory. Input.isKeyPressed() takes a keyCode and returns whether the corresponding key is currently pressed or not. Input.isTouchDown(), Input.getTouchX(), and Input.getTouchY() return whether a given pointer is down, as well as its current x- and y-coordinates. Note that the coordinates will be undefined if the corresponding pointer is not actually touching the screen.

Input.getAccelX(), Input.getAccelY(), and Input.getAccelZ() return the respective acceleration values of each accelerometer axis.

The last two methods are used for event-based handling. They return the KeyEvent and TouchEvent instances that got recorded since the last time we called these methods. The events are ordered according to when they occurred, with the newest event being at the end of the list.

With this simple interface and these helper classes, we have all our input needs covered. Let's move on to handling files.

NOTE: While mutable classes with public members are an abomination, we can get away with them in this case for two reasons: Dalvik is still slow when calling methods (getters in this case), and the mutability of the event classes does not have an impact on the inner workings of an Input implementation. Just note that this is bad style in general, but we will resort to this shortcut every once in a while for performance reasons.

File I/O

Reading and writing files is quite essential for our game development endeavor. Given that we are in Java land, we are mostly concerned with creating InputStream and OutputStream instances, the standard Java mechanisms for reading and writing data from and to a specific file. In our case, we are mostly concerned with reading files that we package with our game, such as level files, images, and audio files. Writing files is something we'll do a lot less often. Usually we only write files if we want to persist high-scores or game settings, or save a game state so users can pick up from where they left off.

We want the easiest possible file-accessing mechanism; Listing 3-2 shows my proposal for a simple interface.

Listing 3-2. The FilelO Interface package com.badlogic.androidgames.framework;

import java.io.IOException; import java.io.InputStream; import java.io.OutputStream;

public interface FileIO {

public InputStream readAsset(String fileName) throws IOException;

public InputStream readFile(String fileName) throws IOException;

public OutputStream writeFile(String fileName) throws IOException;

That's rather lean and mean. We just specify a filename and get a stream in return. As usual in Java, we will throw an IOException in case something goes wrong. Where we read and write files from and to is dependent on the implementation, of course. Assets will be read from our application's APK file, and files will be read from and written to on the SD card (also known as external storage).

The returned InputStreams and OutputStreams are plain-old Java streams. Of course, we have to close them once we are finished using them.

Audio

While audio programming is a rather complex topic, we can get away with a very simple abstraction. We will not do any advanced audio processing; we'll just play back sound effects and music that we load from files, much like we'll load bitmaps in the graphics module.

Before we dive into our module interfaces, though, let's stop for a moment and get some idea what sound actually is and how it is represented digitally.

The Physics of Sound

Sound is usually modeled as a set of waves that travel in a medium such as air or water. The wave is not an actual physical object, but rather the movement of the molecules within the medium. Think of a little pond in which you throw in a stone. When the stone hits the pond's surface, it will push away a lot of water molecules within the pond, and eventually those pushed-away molecules will transfer their energy to their neighbors, which will start to move and push as well. Eventually you will see circular waves emerge from where the stone hit the pond. Something similar happens when sound is created. Instead of a circular movement, you get spherical movement, though. As you may know from the highly scientific experiments you may have carried out in your childhood, water waves can interact with each other; they can cancel each other out or reinforce each other. The same is true for sound waves. All sound waves in an environment combine to form the tones and melodies you hear when you listen to music. The volume of a sound is dictated by how much energy the moving and pushing molecules exert on their neighbors and eventually on your ear.

Recording and Playback

The principle of recording and playing back audio is actually pretty simple in theory: for recording, we keep track of when in time how much pressure was exerted on an area in space by the molecules that form the sound waves. Playing back this data is a mere matter of getting the air molecules surrounding the speaker to swing and move like they did when we recorded them.

In practice, it is of course a little more complex. Audio is usually recorded in one of two ways: in analog or digitally. In both cases, the sound waves are recorded with some sort of microphone, which usually consists of a membrane that translates the pushing from the molecules to some sort of signal. How this signal is processed and stored is what makes the difference between analog and digital recording. We are working digitally, so let's just have a look at that case.

Recording audio digitally means that the state of the microphone membrane is measured and stored at discrete time steps. Depending on the pushing by the surrounding molecules, the membrane can be pushed inward or outward with regard to a neutral state. This process is called sampling, as we take membrane state samples at discrete points in time. The number of samples we take per time unit is called the sampling rate. Usually the time unit is given in seconds, and the unit is called Hertz (Hz). The more samples per second, the higher the quality of the audio. CDs play back at a sampling rate of 44,100 Hz, or 44.1 KHz. Lower sampling rates are found, for example, when transferring voice over the telephone line (8 KHz is common in this case).

The sampling rate is only one attribute responsible for a recording's quality. The way we store each membrane state sample also plays a role, and is also subject to digitalization. Let's recall what the membrane state actually is: it's the distance of the membrane from its neutral state. As it makes a difference whether the membrane is pushed inward or outward, we record the signed distance. Hence, the membrane state at a specific time step is a single negative or positive number. We can store such a signed number in a variety of ways: as a signed 8-, 16-, or 32-bit integer, as a 32-bit float, or even as a 64bit float. Every data type has limited precision. An 8-bit signed integer can store 127 positive and 128 negative distance values. A 32-bit integer provides a lot more resolution. When stored as a float, the membrane state is usually normalized to a range between -1 and 1. The maximum positive and minimum negative values represent the farthest distance the membrane can have from its neutral state. The membrane state is also called the amplitude. It represents the loudness of the sound that it gets hit by.

With a single microphone we can only record mono sound, which loses all spatial information. With two microphones, we can measure sound at different locations in space, and thus get so-called stereo sound. You might achieve stereo sound, for example, by placing one microphone to the left and another to the right of an object emitting sound. When the sound is played back simultaniously through two speakers, we can sort of reproduce the spatial component of the audio. But this also means that we need to store twice the number of samples when storing stereo audio.

The playback is a simple matter in the end. Once we have our audio samples in digital form, with a specific sampling rate and data type we can throw that data at our audio processing unit, which will transform the information into a signal for an attached speaker. The speaker interprets this signal and translates it into the vibration of a membrane, which in turn will cause the surrounding air molecules to move and produce sound waves. It's exactly what is done for recording, only reversed!

Audio Quality and Compression

Wow, lots of theory. Why do we care? If you paid attention, you can now tell whether an audio file has a high quality or not depending on the sampling rate and the data type used to store each sample. The higher the sampling rate and the higher the data type precision, the better the quality of the audio. However, that also means that we need more storage room for our audio signal.

Imagine we record the same sound with a length of 60 seconds twice: once at a sampling rate of 8 KHz at 8 bits per sample, and once at a sampling rate of 44 KHz at 16-bit precision. How much memory would we need to store each sound? In the first case, we need 1 byte per sample. Multiply this by the sampling rate of 8,000 Hz, and we need 8,000 bytes per second. For our full 60 seconds of audio recording, that's 480,000 bytes, or roughly half a megabyte (MB). Our higher-quality recording needs quite a bit more memory: 2 bytes per sample, and 2 times 44,000 bytes per second. That's 88,000 bytes per second. Multiply this by 60 seconds, and we arrive at 5,280,000 bytes, or a little over 5 MB. Your usual 3-minute pop song would take up over 15 MB at that quality, and that's only a mono recording. For a stereo recording, we'd need twice that amount of memory. Quite a lot of bytes for a silly song!

Many smart people have come up with ways to reduce the number of bytes needed for an audio recording. They've invented rather complex psychoacoustic compression algorithms that analyze an uncompressed audio recording and output a smaller, compressed version. The compression is usually lossy, meaning that some minor parts of the original audio are omitted. When you play back MP3s or OGGs, you are actually listening to compressed lossy audio. So, using formats such as MP3 or OGG will help us reduce the amount of space needed to store our audio on disk.

What about playing back the audio from compressed files? While there exists dedicated decoding hardware for various compressed audio formats, common audio hardware can often only cope with uncompressed samples. Before actually feeding the audio card with samples, we have to first read them in and decompress them. We can do this once and store the all uncompressed audio samples in memory, or only stream in partitions from the audio file as needed.

In Practice

You have seen that even 3-minute songs can take up a lot of memory. When we play back our game's music, we will thus stream the audio samples in on the fly instead of preloading all audio samples to memory. Usually, we only have a single music stream playing, so we only have to access the disk once.

For short sound effects, such as explosions or gunshots, the situation is a little different. We often want to play a sound effect multiple times simultaneously. Streaming the audio samples from disk for each instance of the sound effect is not a good idea. We are lucky, though, as short sounds do not take up a lot of memory. We will therefore read in all samples of a sound effect to memory, from where we can directly and simultaneously play them back.

So, we have the following requirements:

We need a way to load audio files for streaming playback and for playback from memory.

■ We need a way to control the playback of streamed audio.

■ We need a way to control the playback of fully loaded audio.

This directly translates into the Audio, Music, and Sound interfaces (shown in Listings 3-3 through 3-5, respectively).

Listing 3-3. The Audio Interface package com.badlogic.androidgames.framework;

public interface Audio {

public Music newMusic(String filename);

public Sound newSound(String filename);

The Audio interface is our way to create new Music and Sound instances. A Music instance represents a streamed audio file. A Sound instance represents a short sound effect that we keep entirely in memory. The methods Audio.newMusic() and Audio.newSound() both take a filename as an argument and throw an IOException in case the loading process fails (e.g., when the specified file does not exist or is corrupt). The filenames refer to asset files in our application's APK file.

Listing 3-4. The Music Interface package com.badlogic.androidgames.framework;

public interface Music { public void play();

public void stop();

public void pause();

public void setLooping(boolean looping); public void setVolume(float volume); public boolean isPlaying(); public boolean isStopped(); public boolean isLooping(); public void dispose();

The Music interface is a little bit more involved. It features methods to start playing the music stream, pausing and stopping it, and setting it to loop playback, which means it will start from the beginning automatically when it reaches the end of the audio file. Additionally, we can set the volume as a float in the range of 0 (silent) to 1 (maximum volume). There are also a couple of getter methods that allow us to poll the current state of the Music instance. Once we no longer need the Music instance, we have to dispose of it. This will close any system resources, such as the file the audio was streamed from.

Listing 3-5. The Sound Interface package com.badlogic.androidgames.framework;

public interface Sound {

public void play(float volume);

public void dispose();

The Sound interface is simpler. All we need to do is call its play() method, which again takes a float parameter to specify the volume. We can call the play() method anytime we want (e.g., when a shot is fired or a player jumps). Once we no longer need the Sound instance, we have to dispose of it to free up the memory that the samples use, as well as other system resources potentially associated.

NOTE: While we covered a lot of ground in this chapter, there's a lot more to learn about audio programming. I simplified some things to keep this section short and sweet. Usually you wouldn't specify the audio volume linearly, for example. In our context, it's OK to overlook this little detail. Just be aware that there's more to it!

Graphics

The last module close to the metal is the graphics module. As you might have guessed, it will be responsible for drawing images (also known as bitmaps) to our screen. That may sound easy, but if you want high-performance graphics, you have to know at least the basics of graphics programming. Let's start with the basics of 2D graphics.

The first question we need to ask goes like this: how on Earth are the images output to my display? The answer is rather involved, and we do not necessarily need to know all the details. We'll just quickly review what's happening inside our computer and the display.

Of Rasters, Pixels, and Framebuffers

Today's displays are raster based. A raster is a two-dimensional grid of so-called picture elements. You might know them as pixels, and we'll refer to them as such in the subsequent text. The raster grid has a limited width and height, which we usually express as the number of pixels per row and per column. If you feel brave, you can turn on your computer and try to make out individual pixels on your display. Note that I'm not responsible for any damage that does to your eyes, though.

A pixel has two attributes: a position within the grid and a color. A pixel's position is given as two-dimensional coordinates within a discrete coordinate system. Discrete means that a coordinate is always at an integer position. Coordinates are defined within a Euclidean coordinate system imposed on the grid. The origin of the coordinate system is the top-left corner of the grid. The positive x-axis points to the right and the y-axis points downward. The last item is what confuses people the most. We'll come back to it in a minute; there's a simple reason why this is the case.

points downward. The last item is what confuses people the most. We'll come back to it in a minute; there's a simple reason why this is the case.

Ignoring the silly y-axis, we can see that due to the discrete nature of our coordinates, the origin is coincident with the top-left pixel in the grid, which is located at (0,0). The pixel to the right of the origin pixel is located at (1,0), the pixel beneath the origin pixel is at (0,1), and so on (see the left side of Figure 3-20). The display's raster grid is finite, so there's a limited number of meaningful coordinates. Negative coordinates are outside the screen. Coordinates greater than or equal to the width or height of the raster are also outside the screen. Note that the biggest x-coordinate is the raster's width minus 1, and the biggest y-coordinate is the raster's height minus 1. That's due to the origin being coincident with the top-left pixel. Off-by-one errors are a common source of frustration in graphics programming.

The display receives a constant stream of information from the graphics processor. It encodes the color of each pixel in the display's raster as specified by the program or operating system in control of drawing to the screen. The display will refresh its state a few dozen times per second. The exact rate is called the refresh rate. It is expressed in Hertz. Liquid crystal displays (LCDs) usually have a refresh rate of 60 Hz per second; cathode ray tube (CRT) monitors and plasma monitors often have higher refresh rates.

The graphics processor has access to a special memory area known as video memory, or VRAM. Within VRAM there's a reserved area for storing each pixel to be displayed on the screen. This area is usually called the framebuffer. A complete screen image is therefore called a frame. For each pixel in the display's raster grid, there's a corresponding memory address in the framebuffer that holds the pixel's color. When we want to change what's displayed on the screen, we simply change the color values of the pixels in that memory area in VRAM.

Display

(0,0)

(1.0)

(2,0)

(0,1)

(1.1)

(2,1)

Figure 3-20. Display raster grid and VRAM, oversimplified color[] vram = new color[3*2]; vram[0] vram[1] vram[2] vram[3] vram[4] vram[5]

(0,0)

(1,0)

(2,0)

(0,1)

(1.1)

(2,1)

Figure 3-20. Display raster grid and VRAM, oversimplified

Time to explain why the y-axis in the display's coordinate system is pointing downward. Memory, be it VRAM or normal RAM, is linear and one dimensional. Think of it as a one-dimensional array. So how do we map the two-dimensional pixel coordinates to one-dimensional memory addresses? Figure 3-20 shows a rather small display raster grid of three-by-two pixels, as well as its representation in VRAM (we assume VRAM only consists of the framebuffer memory). From this we can easily derive the following formula to calculate the memory address of a pixel at (x,y):

We can also go the other way around, from an address to the x- and y-coordinates of a pixel:

int x = address % rasterWidth; int y = address / rasterWidth;

So, the y-axis is pointing downward because of the memory layout of the pixel colors in VRAM. This is actually a sort of legacy inherited from the early days of computer graphics. Monitors would update the color of each pixel on the screen starting at the top-left corner moving to the right, tracing back to the left on the next line, until they reached the bottom of the screen. It was convenient to have the VRAM contents laid out in a manner that eased the transfer of the color information to the monitor.

NOTE: If we had full access to the framebuffer, we could use the preceding equation to write a full-fledged graphics library to draw pixels, lines, rectangles, images loaded to memory, and so on. Modern operating systems do not grant us direct access to the framebuffer for various reasons. Instead we usually draw to a memory area that is then copied to the actual framebuffer by the operating system. The general concepts hold true in this case as well, though! If you are interested in how to do these low-level things efficiently, search the Web for a guy called Bresenham and his line- and circle-drawing algorithms.

Vsync and Double-Buffering

Now, if you remember the paragraph about refresh rates, you might have noticed that those rates seem rather low, and that we might be able to write to the framebuffer faster than the display will refresh. That can happen. Even worse, we don't know when the display is grabbing its latest frame copy from VRAM, which could be a problem if we're in the middle of drawing something. In this case, the display will then show parts of the old framebuffer content and parts of the new state—an undesirable situation. You can see that effect in many PC games, where it expresses itself as tearing (in which the screen shows parts of the last frame and parts of the new frame simultaneously).

The first part of the solution to this problem is called double-buffering. Instead of having a single framebuffer, the graphics processing unit (GPU) actually manages two of them, a front buffer and a back buffer. The front buffer is available to the display to fetch the pixel colors from, and the back buffer is available to draw our next frame while the display happily feeds off the front buffer. When we finish drawing our current frame, we tell the GPU to switch the two buffers with each other, which usually means just swapping the address of the front and the back buffer. In graphics programming literature and API documentation, you may find the terms page flip and buffer swap, which refer to this process.

Double-buffering alone does not solve the problem entirely, though: the swap can still happen while the screen is in the middle of refreshing its content. That's where vertical synchronization (also known as vsync) comes into play. When we call the buffer swap method, the GPU will block until the display signals that it has finished its current refresh. If that happens, the GPU can safely swap the buffer addresses, and all will be well.

Luckily, we barely need to care about those pesky details nowadays. VRAM and the details of double-buffering and vsyncing are securely hidden from us so we cannot wreak havoc with them. Instead we are provided with a set of APIs that usually limit us to manipulating the contents of our application window. Some of these APIs, such as OpenGL ES, expose hardware acceleration, which basically does nothing more than manipulate VRAM with specialized circuits on the graphics chip. See, it's not magic! The reason you should be aware of the inner works, at least at a high level, is that it allows you to understand the performance characteristics of your application. When vsync is enabled, you can never go above the refresh rate of your screen, which might be puzzling if all you're doing is drawing a single pixel.

When we render with non-hardware-accelerated APIs, we don't directly deal with the display itself. Instead we draw to one of the UI components in our window. In our case we deal with a single UI component that is stretched over the whole window. Our coordinate system will therefore not stretch over the entire screen, but only our UI component. The UI component effectively becomes our display, with its own virtual framebuffer. The operating system will then manage compositing the contents of all the visible windows and make sure their contents are correctly transferred to the regions they cover in the real framebuffer.

What Is Color?

You will notice that I have conveniently ignored colors so far. I made up a type called color in Figure 3-20 and pretended all is well. Let's see what color really is.

Physically, color is the reaction of your retina and visual cortex to electromagnetic waves. Such a wave is characterized by its wavelength and its intensity. We can see waves with a wavelength between roughly 400 and 700 nm. That subband of the electromagnetic spectrum is also known as the visible light spectrum. A rainbow shows all the colors of this visible light spectrum, going from violet to blue to green to yellow, followed by orange and ending at red. All a monitor does is emit specific electromagnetic waves for each pixel, which we experience as the color of each pixel. Different types of displays use different methods to achieve that goal. A simplified version of this process goes like this: every pixel on the screen is made up of three different fluorescent particles that will emit light with one of the colors red, green, or blue. When the display refreshes, each pixel's fluorescent particles will emit light by some means (e.g., in the case of CRT displays, the pixel's particles get hit by a bunch of electrons). For each particle, the display can control how much light it emits. For example, if a pixel is entirely red, only the red particle will be hit with electrons at full intensity. If we want colors other than the three base colors, we can achieve that by mixing the base colors. Mixing is done by varying the intensity with which each particle emits its color. The electromagnetic waves will overlay each other on the way to our retina. Our brain interprets this mix as a specific color. A color can thus be specified by a mix of intensities of the base colors red, green, and blue.

Color Models

What we just discussed is called a color model, specifically the RGB color model. RGB stands for red, green, and blue, of course. There are many more color models we could use, such as YUV and CMYK. In most graphics programming APIs, the RGB color model is pretty much the standard, though, so we'll only discuss that here.

The RGB color model is called an additive color model, due to the fact that the final color is derived via mixing the additive primary colors red, green, and blue. You've probably experimented with mixing primary colors in school. Figure 3-21 shows you some examples for RGB color mixing to refresh your memory a little bit.

red

+

blue

red

+

green

green

+

blue

red

+

green

Figure 3-21. Having fun with mixing the primary colors red, green, and blue

We can of course generate a lot more colors than the ones shown in Figure 3-21 by varying the intensity of the red, green, and blue components. Each component can have an intensity value between 0 and some maximum value (say, 1). If we interpret each color component as a value on one of the three axes of a three-dimensional Euclidian space, we can plot a so-called color cube, as depicted in Figure 3-22. There are a lot more colors available to us if we vary the intensity of each component. A color is given as a triplet (red, green, blue) where each component is in the range between 0.0 and 1.0. 0.0 means no intensity for that color, and 1.0 means full intensity. The color black is at the origin (0,0,0), and the color white is at (1,1,1).

Figure 3-22. The mighty RGB color cube

Encoding Colors Digitally

How can we encode an RGB color triplet in computer memory? First we have to define what data type we want to use for the color components. We could use floating-point numbers and specify the valid range as being between 0.0 and 1.0. This would give us quite some resolution for each component and make a lot of different colors available to us. Sadly, this approach uses up a lot of space (3 times 4 or 8 bytes per pixel, depending on whether we use 32-bit or 64-bit floats).

We can do better at the expense of losing a few colors, which is totally OK, as displays usually have a limited range of colors they can emit. Instead of using a float for each component, we can use an unsigned integer. Now, if we use a 32-bit integer for each component, we haven't gained anything. Instead, we use an unsigned byte for each component. The intensity for each component then ranges from 0 to 255. For 1 pixel, we thus need 3 bytes, or 24 bits. That's 2 to the power of 24 (16,777,216) different colors. I'd say that's enough for our needs.

Can we get that down even more? Yes, we can. We can pack each component into a single 16-bit word, so each pixel needs 2 bytes of storage. Red uses 5 bits, green uses 6 bits, and blue uses the rest of 5 bits. The reason green gets 6 bits is that our eyes can see more shades of green than red and blue. All bits together make 2 to the power of 16 (65,536) different colors we can encode. Figure 3-23 shows how a color is encoded with the three encodings described previously.

Figure 3-23. Color encodings of a nice shade of pink (which will be gray in the print copy of this book, sorry)

In the case of the float, we could use three 32-bit Java floats. In the 24-bit encoding case, we have a little problem: there's no 24-bit integer type in Java, so we could either store each component in a single byte or use a 32-bit integer with the upper 8 bits being unused. In case of the 16-bit encoding, we can again either use two separate bytes or store the components in a single short value. Note that Java does not have unsigned types. Due to the power of the two's complement, we can safely use signed integer types to store unsigned values, though.

For both 16- and 24-bit integer encodings, we need to also specify the order in which we store the three components in the short or integer value. There are usually two ways that are used: RGB and BGR. Figure 3-23 uses RGB encoding. The blue component is in the lowest 5 or 8 bits, the green component uses up the next 6 or 8 bits, and the red component uses the upper 5 or 8 bits. BGR encoding just reverses the order. The green bits stay where they are, and the red and blue bits swap places. We'll use the RGB order throughout this book, as Android's graphics APIs work with that order as well. Let's summarize the color encodings discussed so far:

■ A 32-bit float RGB encoding has 12 bytes for each pixel, and intensities that vary between 0.0 and 1.0.

■ A 24-bit integer RGB encoding has 3 or 4 bytes for each pixel, and intensities that vary between 0 and 255. The order of the components can be RGB or BGR. This is also known as RGB888 or BGR888 in some circles, where 8 specifies the number of bits per component.

■ A 16-bit integer RGB encoding has 2 bytes for each pixel; red and blue have intensities between 0 and 31, and green has intensities between

0 and 63. The order of the components can be RGB or BGR. This is also known as RGB565 or BGR565 in some circles, where 5 and 6 specify the number of bits of the respective component.

The type of encoding we use is also called the color depth. Images we create and store on disk or in memory have a defined color depth, and so do the framebuffer of the actual graphics hardware and the display itself. Today's displays usually have a default color depth of 24 bit, and can be configured to use less in some cases. The framebuffer of the graphics hardware is also rather flexible, and can use many different color depths. Our own images can of course also have any color depth we like.

NOTE: There are a lot more ways to encode per-pixel color information. Apart from RGB colors, we could also have grayscale pixels, which only have a single component. As those are not used a lot, we'll ignore them at this point.

Image Formats and Compression

At some point in our game development process, our artist will provide us with images she created with some graphics software like Gimp, Paint.NET, or Photoshop. These images can be stored in a variety of formats on disk. Why is there a need for these formats in the first place? Can't we just store the raster as a blob of bytes on disk?

Well, we could, but let's check how much memory that would take up. Say we want the best quality, so we choose to encode our pixels in RGB888, at 24 bits per pixel. The image would be 1,024 x 1,024 in size. That's 3 MB for that single puny image alone! Using RGB565, we can get that down to roughly 2 MB.

As in the case of audio, there's been a lot of research on how to reduce the memory needed to store an image. As usual, compression algorithms are employed, specifically tailored for the needs of storing images and keeping as much of the original color information as possible. The two most popular formats are JPEG and PNG. JPEG is a lossy format. This means that some of the original information is thrown away in the process of compression. PNG is a lossless format, and will reproduce an image that's 100 percent true to the original. Lossy formats usually exhibit better compression characteristics and take up less space on disk. We can therefore chose what format to use depending on the disk memory constraints.

Similar to sound effects, we have to fully decompress an image when we load it into memory. So, even if your image is 20 KB compressed on disk, you still need the full width times height times color depth storage space in RAM.

Once loaded and decompressed, the image will be available in the form of an array of pixel colors, in exactly the same way the framebuffer is laid out in VRAM. The only difference is that the pixels are located in normal RAM and that the color depth might differ from the framebuffer's color depth. A loaded image also has a coordinate system like the framebuffer, with the origin being in its top-left corner, the x-axis pointing to the right, and the y-axis pointing downward.

Once an image is loaded, we can draw it in RAM to the framebuffer by simply transferring the pixel colors from the image to appropriate locations in the framebuffer. We don't do this by hand; instead we use an API that provides that functionality.

Alpha Compositing and Blending

Before we can start designing our graphics module interfaces, we have to tackle one more thing: image compositing. For the sake of this discussion, assume that we have a framebuffer we can render to, as well as a bunch of images loaded into RAM that we'll throw at the framebuffer. Figure 3-24 shows a simple background image, as well as Bob, a zombie-slaying ladies man.

0 0

Post a comment