Fog Creek Software
Discussion Board

Help Optimize this code (II)

I optimized the code as suggested in the previous thread by drawing the bitmap via lines and after running it through the profiler it is extremely fast.  In fact in my tests it is faster than the suggested transparent BitBlt.  So here is the code (hope you can read it):

void DrawMonochromeBitmap(HDC hdc, int x, int y, int width, int height, PBYTE pBitmap)
  int yOffset, dy = 0, RowNo = 0, BytesPerRow = ((width + 31) >> 5) << 2;
  int BitMask[8] = { 128, 64, 32, 16, 8, 4, 2, 1 };
  do {
    dx = 0;
    yOffset = y + dy;
    do {
        if (pBitmap[RowNo + (dx >> 3)] & BitMask[dx & 7])
            MoveToEx(hdc, x + dx, yOffset, NULL);
            do {
            } while ((pBitmap[RowNo + (dx >> 3)] & BitMask[dx & 7]) && dx < width);
            LineTo(hdc, x + dx, yOffset);
    } while (dx < width);
    RowNo = ++dy * BytesPerRow;
  } while (dy < height);

So my question is are there any more optimizations that can be done?  I thought maybe I could get rid of the Bitmask and the bit shifting but in my eyes that seems almost impossible.

Wednesday, July 14, 2004

Use a profiler: if 90%+ is spent inside MoveToEx and LineTo then there's no point in micro-optimising.

If height > width then draw lines vertically instead of horizontally.

If pBitmap[RowNo + ...] == 0 or == 0xFF this is a special case (no need to test each bit).

Use a lookup table to see whether pBitmap[RowNo + ...] is another common special case: all 0 bits followed by all 1 bits, or all 1 bits followed by all 0 bits.

Christopher Wells
Wednesday, July 14, 2004

Here is one...for when all bits are set in the byte
do {
              if ( pBitmap[RowNo + (dx >> 3)] == 0xFF && dx < width-8) { dx += 8 ; continue ; }


Code Monkey
Wednesday, July 14, 2004

Thanks for the comments guys.  It'll take me some time to think about how to efficiently incorporate those into the code.

Wednesday, July 14, 2004

"If height > width then draw lines vertically instead of horizontally."

Often, it is a safe assumption that the image 'scan lines' are horizontal.  Vertical access of pixels is almost always much less efficient, and trashes any average processor cache.

i like i
Thursday, July 15, 2004

Again, my guess is that each of the two Win32 API subroutine calls is so expensive (does so much work) that it flushes the processor cache anyway ... my intent in trying it vertically is only to minimize the number of times these subroutines are called.

Christopher Wells
Thursday, July 15, 2004

It looks like your current algorithm is maxed out so-to-speak.  The optimizations presented here are good but will require you to rework your current algorithm.

What I would do next is to rearrange the data in a format that will allow a more efficient algorithm to be used.  Run Length Encode the bitmap and then write a routine, still using MoveToEx and LineTo, to render it.  The format of the RLE would be something like (you would have to test it):

Height in Rows of Bitmap, Run count for this row, Relative start of run from end of previous run,  length of the run,... (repeat with Runs for the next row)... etc...

The point is that sometimes the format of the data is the limiting factor in the (speed of the) algorithm used so restating the data in a more efficient format can allow for a more efficient algorithm to be used.

For example, you just have to increment a pointer with the new data format and you will automatically know when you are on a new row and where you have to draw the lines.  You no longer have to decode the bits of each byte.  The trade off is obviously space.  The size of the bitmap will determine the data type used.  For small bitmaps you could get away with using a char and might even save space.

Dave B.
Thursday, July 15, 2004

Geez people, trying to optimize GDI is an oxymoron! That's why they made DirectX, for when speed is needed...

Game Programmer Lou
Thursday, July 15, 2004

I'm not trying to optimize GDI.  I don't believe GDI is very optimized to begin with.  If it was it would BE DirectX.

The fact is that TextOut is slow and I'm thinking I can create dynamic text by caching all character bitmaps in the ASCII character set of a specified font using the GetGlyphOutline function.  Then rendering the bitmaps.  So far it works and it is faster (in my tests at least) than BitBlt with all those raster ops and it is faster than TextOut.

I'm just a hobby coder anyway so it's an interesting exercise.  As for using DirectX.  I don't want to.  I would have to use DirectX 7.0 for direct access to the screen buffer.  I have no idea why MS would even think about removing such a vital capability.  (Unless DX 9 brought it back.)

Thursday, July 15, 2004

Nothing was removed in DirectX, otherwise a generation of programs would no longer work. Just query for IDirectDraw7 and call its CreateSurface or whatever you want.

But what do you mean by "rendering" bitmaps in GDI -- without BitBlt? Drawing each pixel with SetPixel or one line at a time with GDI lines is in NO WAY faster than a BitBlt of the entire thing. If you think it is, you need to really examine your test code....

Game Programmer Lou
Thursday, July 15, 2004

I agree with Lou... this exercise is singularly pointless. If you want fast graphics, use a fast graphics API. Even if you do manage to achieve a small speed improvement by this method it would still be a horrificly ugly and perverse hack. Using an API to plot pixels or lines of an image one at a time being faster than using that same API to plot the whole image, just sounds wrong... if this is the case then evidently the library needs to be much better optimised for plotting monochrome images, and this optimisation would be much better done in the library code itself than at the level you're trying to do it...

Thursday, July 15, 2004

Lou is right. I find it hard to believe that any of these approaches would be faster than a good old BitBlt. There are however. many things that can greatly slow down the speed of a BitBlt:

1. If the formats of the two bitmaps are different it can be slow. For example, blitting from a bitmap of one color depth to another of a different color depth. For maximum performance, the source bitmap should match the destination bitmap (size doesn't matter - there's a joke in there somewhere).

2. The speed can depend on the raster-op chosen.

3. The device driver and hardware for the output device can have a big influence on #1, #2, and the general performance of the BitBlt. When bltting to a device, the driver does much of the work. You will find greatly different performance when running the same code on different hardware or even the same hardware with different color depths so the apparent speedups you get on one display may actually be slower than the original code on a different display.

The OP never showed how he did the original BitBlt that was supposededly slow so we couldn't comment on anything that might have affected the performance of that code.

A Graphics Programmer
Thursday, July 15, 2004

If drawing monochrome bitmaps is so slow, why not preconvert them to a format for which blitting is fast?

Do it one time on startup. Time and choose the best one. Then dump GDI and use DirectDraw.

(You will have to write a separate routine for each bit depth you support, which is tiresome... but there you go. You will probably have better results by converting onto the DirectDraw surface from a canonical format in system memory, particularly if you are not going to write every byte in the bitmap, *definitely* if you are going to read back from the bitmap, and maybe if you are likely to incur a lot of cache misses while generating the bitmap. Only an issue in windowed mode, full screen you can choose what you want (I recommend 8bpp palettized or 15/16bpp on account of their wide availability and reduced memory overhead).)

Thursday, July 15, 2004

This is the BitBlt version of the code.  Any constructive comments on it would be appreciated.

void DrawMonochromeBitmap(HDC hdc, int x, int y, int width, int height, PBYTE pBitmap, COLORREF color)
HBITMAP hBmpMask = NULL, hBmpImage = NULL, hBmpTemp = NULL;
HDC hdcMask = NULL, hdcImage = NULL, hdcTemp = NULL;
RECT rc;

bm.bmBits = pBitmap;
bm.bmBitsPixel = 1;
bm.bmHeight = height;
bm.bmPlanes = 1;
bm.bmType = 0;
bm.bmWidth = width;
bm.bmWidthBytes = ((width + 31) >> 5) << 2;

// create a bitmap from the glyph bitmap bits
hBmpMask = CreateBitmapIndirect(&bm);

// create the glyph bitmask
hdcMask = CreateCompatibleDC(hdc);
SelectObject(hdcMask, hBmpMask);    

// create a bitmap with the foreground color or texture
hdcImage = CreateCompatibleDC(hdc);
hBmpImage = CreateCompatibleBitmap(hdc, width, height);
SelectObject(hdcImage, hBmpImage);
// fill bitmap with desired color
// * - instead of a solid color a bitmap
// could be blitted to hdcImage to create
// a texture mapped glyph - *
hBrush = CreateSolidBrush(color);
SelectObject(hdcImage, hBrush);    
FillRect(hdcImage, &rc, hBrush);

// create a glyph image bitmap
hdcTemp = CreateCompatibleDC(hdc);
hBmpTemp = CreateCompatibleBitmap(hdc, width, height);
SelectObject(hdcTemp, hBmpTemp);

// invert mask
InvertRect(hdcMask, &rc);
BitBlt(hdcTemp, 0, 0, width, height, hdcMask, 0, 0, SRCAND);
BitBlt(hdcTemp, 0, 0, width, height, hdcImage, 0, 0, SRCPAINT);

// mask the dc being drawn to
BitBlt(hdc, x, y, width, height, hdcMask, 0, 0, SRCAND);
// back to original mask
InvertRect(hdcMask, &rc);
BitBlt(hdcTemp, 0, 0, width, height, hdcMask, 0, 0, SRCAND);    
BitBlt(hdc, x, y, width, height, hdcTemp, 0, 0, SRCPAINT);


See next post for RLE code.

Thursday, July 15, 2004

This is the RLE version of the code.  It beats the original by a factor of at least 2.  I haven't compared it to the BitBlt but it feels faster when put to use.

// render a run length encoded glyph bitmap
void DrawRLEMonochromeBitmap(HDC hdc, int x, int y, RLE_TYPE_POINTER pBitmap)
  int dx, height, RunCount;
  RLE_TYPE_POINTER pTemp = pBitmap;
  height = (*pTemp++);    
  do {
    dx = x;
    RunCount = (*pTemp++);
    while (RunCount--)
        dx += (*pTemp++);    
        MoveToEx(hdc, dx, y, NULL);
        LineTo(hdc, dx + (*pTemp++), y);    }
  } while (--height);

Thursday, July 15, 2004

Ops... deleted a SetRect(&rc,0,0,widht,height) from the BitBlt version... but you still get the idea about how it works.

Thursday, July 15, 2004

I would think that your BitBlt routine would be faster than any of the rest.  Basically you need to be using DirectX as suggested earlier.  As for RLE encoding... well that was good in the DOS days but now-a-days you'd be better off just going with the hardware blitter.

Thursday, July 15, 2004

Not much to add, but make this static and const if you're not messing with it:

static const int const BitMask[8] = { 128, 64, 32, 16, 8, 4, 2, 1 };

Friday, July 16, 2004

There's lots of micro-optimizations you could do, like get rid of this multiply

RowNo = ++dy * BytesPerRow;

becomes (RowNo needs to start form the top, 0)

RowNo += BytesPerRow

But the real problem is why BitBlt isn't fast enough.

Nearly always the quickest running way to do any bitmap operations, in GDI, is to hand-code the bitmap operations as manipulating a chunk of memory, use a profiler to speed up the slow bits, and BitBlt the result to the screen in one go.

Windows API calls, I'm sure DirectX calls too, are comparatively massively expensive - as compared to C code poking around in memory. So do as few as possible.

S. Tanna
Saturday, July 17, 2004

*  Recent Topics

*  Fog Creek Home