Saturday, August 30, 2014

Fixing Textadept's console window flashing issue (with API hooking in Free Pascal)

Textadept is a lovely little text editor I have been using alongside Emacs for some time. There is one small annoyance about it: on Windows, a black console window pops up then disappears every time it runs a sub-process to perform some task.

The author is well aware of this issue. But according to him, "there is  no way around this" (  He obviously knows how to create sub-process on Windows using CreateProcess but the problem is that the rest of the application code all use FILE* for file I/O. FILE is an opaque type, there is no easy way to convert a Win32 pipe HANDLE to a FILE*.

Today, I decided to give this a shot. Firstly, I need to find out why is the console window popping up at all. I know whatever Textadept uses to create sub-process (glib or lua?), it must eventually call the Win32 CreateProcess function to do the job. So let's fire up our debugger (I used x64_dbg and set a breakpoint at kernel32!CreateProcessA to see what went wrong.

When the debugger breaks in, the stack looks like this:

The first item is the return address, then the 10 arguments for the CreateProcess function. The most interesting one is the 9th argument 'lpStartupInfo' so let's check out what is inside:

In order to hide a sub-process's window, the parent process needs to set the flag STARTF_USESHOWWINDOW in dwFlags, and set the show window command in wShowWindow to SW_HIDE. Looking at the screenshot, the caller obviously didn't set them up correctly (dwFlags and wShowWindow are the 2 words at the end of the 3rd row and the beginning of the 4th row). This is why we are seeing the black console window popping up.

So how do we fix this? I really don't feel like to download the source code and all its dependencies then spend half a day trying to figure out how to build the application from source, especially when its author has explicitly declared "Compiling Textadept on Windows is no longer supported. The preferred way to compile for Windows is cross-compiling from Linux".

But wait, isn't Textadept mostly written in Lua? Lua is well known for being extremely easy to extend. How about a Lua dll extension that dynamically patches the CreateProcess call and supply it with correct arguments?

It didn't take me much time to decided I want to write this extension dll in Free Pascal. Why? because Free Pascal ships with Lua units out of the box.

Putting together a minimal extension in Pascal is just a matter of minutes:

library fixpopen;
{$mode objfpc}{$H+}

  Classes, lua, lauxlib, lualib, windows;

function patch(L:Plua_State):integer;cdecl;

function unpatch(L:Plua_State):integer;cdecl;

function libinit(L:Plua_State):integer;cdecl;export;
  lua_register(L, 'fix_popen_patch', @patch);
  lua_register(L, 'fix_popen_unpatch', @unpatch);



Compile this code to a DLL, put it in the same directory as Textadept, then add these lines to ~/.textadept/init.lua

mylibinit = package.loadlib("fixpopen.dll", "libinit")

Open DbgView and run Textadept. The tracing messages appeared in DbgView as you'd' expected.

Now it's time to add the API hooking. The term "API hooking" sounds intimidating but the principle is really simple. We put a small piece of code at the entry point of the function we want to hook. This piece of code, we call it a 'thunk', will catapult us to our own function, there, we need to restore the first few bytes of the original function, call the orignal function, then put the thunk back so that we can intercept the next API call.

This is the definition of the thunk type:

  TThunk = record
    jmp   : byte;
    offset: longword;

It is just one assembly instruction. The first byte is the op code, which should be $e9, or JMP. Followed by a 4-byte offset.

We need 2 of these thunks, one to store our JMP instruction, another one to store the original content which will be overwritten by our thunk so that we can restore it later.

Some initialization is needed when the DLL is loaded into the Textadept process:
  thunk : TThunk = (jmp:$e9; offset:$0);
  save  : TThunk = (jmp:$0;  offset:$0);
  w32CreateProcess : TCreateProcess = nil;

  if w32CreateProcess = nil then
     // save the API call address
     w32CreateProcess := TCreateProcess(GetProcAddress(GetModuleHandle('kernel32.dll'), 'CreateProcessA'));
     // save API function prelude
     CopyMemory(@save, w32CreateProcess, sizeof(TThunk));
     // fill in the thunk
     thunk.offset:= pointer(@myCreateProcess) - pointer(w32CreateProcess) - 5;

Here we save the address of the CreateProcess function, and its first 5 bytes in 'save'. We also fill in the thunk with the jump offset, which is the relative distance between the next instruction after the JMP instruction (that's where the -5 comes from) to the target address.

Now the fun part, patching the Windows API code.

function patch(L:Plua_State):integer;cdecl;
  bret: BOOL;
  bret:=VirtualProtect(w32CreateProcess, sizeof(TThunk), PAGE_EXECUTE_READWRITE, @protect);
  CopyMemory(w32CreateProcess, @thunk, sizeof(TThunk));
  VirtualProtect(w32CreateProcess, sizeof(TThunk), protect, nil);

Because the code segment is write protected by default, we first need to turn off the protection. Then we just copy the thunk over CreateProcess's code. We already have a saved copy of that memory so we can easily restore that when we need the original code back.

The unpatch function is almost identical, except it copies the saved stuff back.

  bret: BOOL;
  VirtualProtect(w32CreateProcess, sizeof(TThunk), PAGE_EXECUTE_READWRITE, @protect);
  CopyMemory(w32CreateProcess, @save, sizeof(TThunk));
  VirtualProtect(w32CreateProcess, sizeof(TThunk), protect, nil);

Finally, the redirected CreateProcess function:

function myCreateProcess(...
  lpStartupInfo^.dwFlags:=lpStartupInfo^.dwFlags or STARTF_USESHOWWINDOW;
  result := w32CreateProcess(lpApplicationName,lpCommandLine,lpProcessAttributes,

It has exactly the same signature as CreateProcessA. Once we get in here, we first fix the original function so that it can be called. Then we modify the StartupInfo argument to hide the console window. Lastly, call the windows CreateProcessA function and put the thunk back.

And this perfectly solves the black console window flashing issue in Textadept.

The complete source code can be found here

Saturday, June 28, 2014

Dissecting the 128-byte raycaster

Lately a Reddit post caught my attention which is about a raycaster demo, animated, with mouse control and texture mapping,  in only 128 bytes. It looks like this:

And this is it in motion:

Impressive, isn't it? Out of curiosity, I took a look at the source code  ( and spent some time trying to figure out how it works. So now I'd like to share my findings here.

The algorithm

For each pixel, wolf128.asm performs two ray marching, one on X-Z plane, and one on Y-Z plane, to find out the distance to the nearest wall/ceiling/floor.  The ray marching happens on 2D planes and are always axis aligned so there is no vector math needed.

For simplicity, below I'll use GLSL code to show how it works.  Let's consider only the vertical direction first:
out vec4 outputColor;
uniform vec2 iResolution;
void main(void)
    // normalize y coordinate to 0-200, in consistent with wolf128's
    // vga 320x200 mode
    vec2 uv = gl_FragCoord.xy / iResolution.xy;
    uint y = uint(uv.y * 200.0);
    // initial depth is 0
    uint z = 0u;
    // the distance to wall(ceiling/floor)
    uint ydist;
    // we'll trace a maximun of 256 steps
    for(int i = 0; i < 256; ++i)
        // the distance to wall is propotional to depth,
        // and propotional to the offset of the pixel from
        // the center of the screen(which is 100, because
        // the height of the screen is 200). In order to be
        // consistent with wolf128, everything uses unsigned
        // numbers, and we keep 8-bit precision for the result.
        ydist = ((y - 100u) * z + 4096u) >> 8;
        // If the distance exceeds a predefined threshold
        if (ydist >= 32u)
        // next itration, increase depth by 1

    // normalize the depth value to [0,1]
    float c = float(z) / 256.0;
    outputColor = vec4(vec3(c), 1.0);
The output of this fragment shader is like this:

Fairly simple isn't it? The only thing worth mention is the +4096 in our distance calculation. The reason we need this is because the nature of unsigned numbers.  Imagine when the Y coordinate is less then 100, for example 99,  now 99-100 will underflow and becomes a very large number, immediately exceeds the threshold.  The answer is to add 4096 to the calculation.  Why 4096? Because we shift result right by 8 bits and  4096 >> 8 = 16, which is in the middle of our threshold (32).  This effectively shifts all results to the positive direction by 16, so <100 coordinates will get results starting from 16 and gradually decrease from there towards 0 until it underflows and  trigger the >=32 condition.  Meanwhile results from positive coordinates will start from 16 and adds up from there until reach 32.

Now let's add horizontal walls:
out vec4 outputColor;
uniform vec2 iResolution;
void main(void)
    vec2 uv = gl_FragCoord.xy / iResolution.xy;
    uint x = uint(uv.x * 320.0);
    uint y = uint(uv.y * 200.0);   
    uint z = 0u;
    uint xdist, ydist;
    for(int i = 0; i < 256; ++i)
        // horizontal center is 160
        xdist = ((x - 160u) * z + 4096u) >> 8;
        ydist = ((y - 100u) * z + 4096u) >> 8;
        // break the loop if we hit a wall in either direction
        if ((xdist >=32u) || (ydist >= 32u))
    float c = float(z) / 256.0;
    outputColor = vec4(vec3(c), 1.0);
Almost identical, only added distance tracing for horizontal walls. And the result is no surprising:

The above two shaders merely renders the depth value of a virtual tunnel (or distance field, if you prefer to use that term).  Now let's add some textures.

Given the fact that for each pixel we already have the distance to horizontal and vertical walls and we have the depth value, that's enough information on all 3 dimensions.  A very simple but effective way to generate some texture is to XOR all of them together, like this:
   uint texel = xdist ^ ydist ^ z;
    texel %= 16u; // adjust the period of the pattern to your liking
    float c = float(texel) / 16.0;
    outputColor = vec4(vec3(c), 1.0);

The result is this:

Now we are talking!

The next step is to make some corridors in  this tunnel.  We can do this with the depth information which we already have. The only change we need to make is change the loop exit condition. When we hit a horizontal wall, do not exit immediately, instead, check whether the depth satisfies some sort of conditions, for example turn wall on and off for every other 32 units:
   if ((xdist >=32u && ((z & 32u) != 0u)) || (ydist >= 32u))
This is what we get from this change:
Last thing, right now the maze looks quite boring, let's add some variation to it.  The way we do it is to add or subtract some random values before checking the distance:
       switch(z / 64u)
        case 0u: xdist -= 10u; break;
        case 1u: xdist -= 40u; break;
        case 2u: xdist -= 20u; break;
        case 3u: xdist -= 30u; break;
Now it looks almost exactly the same as the original wolf128.asm program:

The full fragment shader code that produces the above output is this:
out vec4 outputColor;
uniform vec2 iResolution;
void main(void)
   vec2 uv = gl_FragCoord.xy / iResolution.xy;
    uint x = uint(uv.x * 320.0);
    uint y = uint(uv.y * 200.0);
    uint z = 0u;
    uint xdist, ydist;
    for(int i = 0; i < 256; ++i)
        xdist = ((x - 160u) * z + 4096u) >> 8;
        ydist = ((y - 100u) * z + 4096u) >> 8;
        switch(z / 64u)
        case 0u: xdist -= 10u; break;
        case 1u: xdist -= 40u; break;
        case 2u: xdist -= 20u; break;
        case 3u: xdist -= 30u; break;
        if ((xdist >=32u && ((z & 32u) != 0u)) || (ydist >= 32u))

    uint texel = (xdist & 255u) ^ (ydist & 255u) ^ z;
    texel %= 16u;
    float c = float(texel) / 16.0;
    outputColor = vec4(vec3(c), 1.0);
 That's all about the algorithm. Next to follow is

Assembly code analysis

In order to minimize the size, wolf128.asm is a DOS .COM file. The advantage of .COM file is that it has zero fat. Every single byte in the file is program code. When executed, DOS(or Windows) loads the code to offset 0x100 of the current code segment, and fill the 256 bytes before it with the so called Program Segment Prefix (or PSP).

Below I'll explain the code line by line.

  ; Call BIOS INT 10h set the display mode to VGA mode 13h,
  ; which is 320x200,256 color. It takes advantage of ah's
  ; initial value being 0, therefore save one instruction to
  ; load ah.
  mov al,13h
  int 10h
  ; bx has initial value 0, and ds is the same as cs,so ds:bx
  ; points to PSP.
  ; es <- 9FFF (PSP+2 is the end of DOS program area,which is 9FFF)
  ; dx is also assigned,but not used。Why load es with 9fff? That's
  ; because the VGA frame buffers starts at A000
  les dx,[bx]
  ; cl <- 255, max loop count
A:mov cl,0ffh
  ; bl <- cl,bl keeps track of depth
B:mov bl,cl
  ; this is equivalent to bl = 255-cl
  not bl
  ; ax <- 0xffee (di has initial value 0xfffe,minus 10h and get 0xffee)
  ; This will be the screen buffer size. Why 0xffee instead of
  ; 320x200=0xfa00? This is because in order to save space, wolf128 always
  ; writes 65535 bytes for each frame. Since our es segment is 9fff, and di
  ; is fffe, after 2 bytes, di will wrap back to 0, which is 9fff:0, and 16
  ; bytes before the VGA frame buffer at a000:0. So that's a total of 18
  ; extra bytes written before the frame buffer: 65535 - 18 = 0xffee
  lea ax,[di-10h]
  ; bp <- 320,screen width
  mov bp,140h
  ; dx <- 0000,cdq is in fact a 32-bit instruction, it sign extends eax to
  ; edx. here it takes advantage of the fact that the high word in eax is
  ; always 0 in 16-bit programs.
  ; DX:AX(FFEE) divide by BP(320),get quotient(Y coordinate) in AX
  ; remainder(X coordinate) in DX
  div bp
  ; calling G twice is equvalent to
  ; ax = ((ax - 100) * z + 0x13b0);
  ; dx = ((dx - 100) * z + 0x13b0);
  ; You should already be familiar with this formular. But Why 0x13b0
  ; instead of 0x1000(4096)? And why both are minus 100? Those are all
  ; for reducing program size. 0x13b0 is the first 2 bytes of the
  ; program, which can be loaded by taking advantage of SI's initial
  ; value being 0x100. And if the 2 formulars are written separately
  ; it'll take more instructions. The results are kept in 16-bit
  ; registers AX and DX but later we only use AH and DH, this is like
  ; an 8-bit right shift for free.
  call G
  call G
  ; the program keeps mouse button state at ds:1dh
  test [1dh],7
  ; if no button is pressed then no need to look at mouse position
  jz C
  ; ds:1fh keeps mouse X coordinate. Add to x distance if button is pressed
  add dh,[1fh]
  ; ds:1eh is mouse Y coordinate, subtract from depth
  sub bl,[1eh]
  jmp short E
  ; fs has initial value 0, under DOS address 0x46c stores the current
  ; tick count. Put the tick in DL.
C:mov dl,[fs:46ch]
  ; Subtract an arbitary constant(0xb0,the first byte of program)
  ; It's picked to avoid collision with the walls too frequently
  sub bl,[si]
  ; move to the right every 64 ticks
  test dl,40h
  jnz D
  add dh,dl 
  ; move forward
D:add bl,dl
  ; The shld instruction here effectively shifts depth right by 6 bits
  ; and put it in BP
E:shld bp,bx,10
  ; x distance minus an arbitary offset parameterized by depth>>6 to add
  ; variaty to the scene.
  ; The data used here is the first 4 bytes of PSP, but can be anything.
  sub dh,[bp]
  ; Save x distance in al
  mov al,dh
  ; What these 3 instructions do are very similar to
  ; (xdist >=32u && ((z & 32u) != 0u)) || (ydist >= 32u) 
  and dh,bl
  or dh,ah
  and dh,20h
  ; If condition is not met then --cl and keep marching
  loopz B

  ; XOR al,bl,ah(x,y,z) together to create texture
  xor al,bl
  xor al,ah
  ; This undocumented instructino is a variant of AAM. It splits the
  ; high and low 4 bits of AL and put them in AH and Al. This effectively
  ; performs al % 16.
  db 0d4h,10h
  ; Add 16 to texure. Why 16? because the we want to map the texture
  ; value to the second row of the VGA mode 13h palette (which is
  ; different shades of grays)

  add al,10h
  ; write al to es:di (and di increase by 1)

  ; If di is not 0, it means we are not finished with a frame
  ; go back to compute the next pixel
  or di,di
  jnz A

  ; read mouse state
  ; INT 33,3 - Get Mouse Position and Button Status
  ; CX = horizontal (X) position  (0..639)
  ; DX = vertical (Y) position  (0..199)
  ; BX = button status:
  ; |F-8|7|6|5|4|3|2|1|0|  Button Status
  ; |  | | | | | | | `---- left button (1 = pressed)
  ; |  | | | | | | `----- right button (1 = pressed)
  ; `------------------- unused
  mov ax,3
  int 33h
  ; left button down?
  test bl,al
  ; Store button state at 1dh
  mov [1dh],bl
  ; If button is not down, goto F
  jz F
  ; If button is down, store Y coordinate at 1eh
  mov [1eh],dl
  ; X coordinate at 1fh
  mov [1fh],cl

  ; Read keyboard scancode from port 60h
F:in ax,60h
  ; The scancode of ESC is 1, so if scancode - 1 == 0, then quit
  dec ax
  jnz B

  ; This procedure is the formular
  ; ((x - 100) * z + 0x13b0)
G:xchg ax,dx
  sub ax,64h
  imul ax,bx
  add ax,[si]

This is it, the assembly code explained. Being assembly code it may still be rather difficult to read so below is a clone I wrote in D language. It does exactly the same thing as the assembly code, and has exactly the same code flow and is even annotated with the original assembly code, have fun!

import std.stdio;
import std.datetime;
import std.algorithm;
import std.array;
import std.random;
import derelict.sdl2.sdl;

void render(SDL_Renderer* ren, SDL_Texture* tex)

   // the VGA frame buffer starts from A000 but wolf128's ES points
   // to 9FFF so the first 16 bytes are not displayed.
   ubyte[65536] fb;            // 9fff:0000 - 9fff:ffff
   // initial values;
   ubyte[] data = [0xCD, 0x20, 0xFF, 0x9F]; // the first 4 bytes of PSP
   ubyte mouseButton = 0xff;   // ds:1dh
   ubyte mouseX  = 0xff;       // ds:1eh
   ubyte mouseY  = 0xff;       // ds:1fh
   ushort idx = 0xfffe;        // di = 0xfffe on start
   // registers
   ubyte  stepsLeft;           // cl
   ubyte  z;                   // bl
   ushort offsetInFrame;       // ax
   ushort width;               // bp
   ushort y;                   // ax/ah
   ushort x;                   // dx/dh
   ubyte  texCoord;            // al
   ubyte  texel;               // al
   uint   temp;                // dx:ax
   uint   temp2;               // bp
   // A:mov cl,0ffh
   stepsLeft = 255;
   // B:mov bl,cl
   //   not bl
   z = 255 - stepsLeft;
   offsetInFrame = cast(ushort)(idx - 16); // lea ax,[di-10h]
   width = 320;                // mov bp,140h
   temp = offsetInFrame;           // cdq
   y = cast(ushort)(temp / width); // div bp
   x = cast(ushort)(temp % width);
   x = cast(ushort)((x - 100) * z + 0x13b0); // call G
   y = cast(ushort)((y - 100) * z + 0x13b0); // call G
   x >>= 8;                    // ; x in dh
   y >>= 8;                    // ; y in ah
   // test [1dh],7
   // jz C
   if (mouseButton & 7)
       x += mouseX;            // add dh,[1fh]
       z -= mouseY;            // sub bl,[1eh]
       // C:mov dl,[fs:46ch]
       ubyte ticks = cast(ubyte)(Clock.currSystemTick().msecs/60);
       z -= 0xb0;       // sub bl,[si]
       // test dl,40h
       // jnz D
       if ((ticks & 0x40) == 0)
           x += ticks;         // add dh,dl
       z += ticks;             // D:add bl,dl
   temp2 = z >> 6;             // shld bp,bx,10
   x -= data[temp2];           // sub dh,[bp]
   texCoord = cast(ubyte)x;    // mov al,dh
   x &= z;                     // and dh,bl
   x |= y;                     // or dh,ah
   x &= 0x20;                  // and dh,20h
   if (stepsLeft-- && x == 0)  // loopz B
       goto nextstep;
   texel = texCoord ^ z;         // xor al,bl
   texel = texel ^ cast(ubyte)y; // xor al,ah
   texel %= 16;                  // db 0d4h,10h
   texel += 16;                  // add al,10h; map to the 2nd row of VGA palette
   fb[idx++] = texel;            // stosb
   // or di,di
   // jnz A
   if (idx != 0)
       goto nextframe;
   // mov ax,3
   // int 33h
   SDL_Event event;
   if (SDL_PollEvent(&event))
       // test bl,al
       // mov [1dh],bl
       mouseButton = (event.type == SDL_MOUSEBUTTONDOWN) ? 2 : 0;
       if (mouseButton)
           mouseX = cast(ubyte)event.button.x; // mov [1eh],dl
           mouseY = cast(ubyte)event.button.y; // mov [1fh],cl
       // in ax,60h
       // dec ax
       // jnz B
       if (event.type == SDL_KEYUP && event.key.keysym.sym == SDLK_ESCAPE)
   SDL_UpdateTexture(tex, null, cast(const(void)*)convert(fb[16..(16+320*200)]), 320 * uint.sizeof);
   SDL_RenderCopy(ren, tex, null, null);
   goto nextframe;

// ---------------------------------------------------------------------

uint convertPixel(ubyte p)
   p = cast(ubyte)((p - 16) * 16);
   return p | (p << 8) | (p << 16) | 0xff000000;

uint[] convert(const(ubyte)[] buf)
   return map!convertPixel(buf).array;

int main(char[][] args)
   SDL_Window* win;
   SDL_Renderer* ren;
   SDL_CreateWindowAndRenderer(320, 200, 0, &win, &ren);
   SDL_Texture* tex = SDL_CreateTexture(ren, SDL_PIXELFORMAT_ARGB8888, SDL_TEXTUREACCESS_STREAMING, 320, 200);
   render(ren, tex);
   return 0;