- Final Fantasy VIII engine

[eng] Secrets of LZS native unpacking :: by Demitsuri ::

0.Intro
Exploring the disassembly I finded an interesing function at address 0x0040f852 (I researched FF8 PC patch v1.2 executable). It gets two buffer pointers and copies data from one to another. But size of these buffers differs in general. After brief disassembly reading I understood that 0x0040f852 is address of unLZS function. Ficedula's LZS description is accurate, but it's mostly a view from algorithm side. And we'll look on it from assembly side.

1. Developer's footprints
FF8 code is fullfilled with those things. The whole brunches of code never run. This function also contains "hello" at address 0x0040f883:
.text:0040F883 loc_40F883:                             ;jmp reference
.text:0040F883                 mov     ecx, 1          ;ecx = 1
.text:0040F888                 test    ecx, ecx        ;ecx == 0 ?
.text:0040F88A                 jz      loc_40F9D1      ;if true, jump to exit
Nice thing, isn't it?

2. Virtual circular buffer
LZS alorithm uses 4k circular buffer for packing/unpacking. It was described by Ficedula. I wouldn't repeat well-known things. If you don't know what we're speak about, just read that article. Maybe PSX platform implements circular buffers, I don't know, but on PC usage additional buffer is a great speed loss. FF8 developers solved this problem by buffer emulating. You need buffer when you're meet reference bit (0) in control byte and why you can't use your output buffer as circular when you need it? The main problem in this case is calculating right referenced data pointer.
;ecx == refLen (length of referenced sequence stored in lower half-byte of reference word)
.text:0040F942                 mov     edx, [ebp+pDestBuffPos]
.text:0040F945                 lea     eax, [edx+ecx+3]
.text:0040F949                 mov     [ebp+pDestBuffRefEnd], eax
.text:0040F94C                 mov     ecx, [ebp+pDestBuffPos]
.text:0040F94F                 sub     ecx, [ebp+pDestBuff] ; after this ecx == current decoded data size
.text:0040F952                 mov     edx, [ebp+refData]
.text:0040F955                 sub     edx, 0FEEh ;   0x0fee 
                                                  ; + 0x0012 (max referenced data size)
                                                  ; = 0x1000 (circular buffer size)
.text:0040F95B                 sub     ecx, edx
.text:0040F95D                 and     ecx, 0FFFh
.text:0040F963                 mov     eax, [ebp+pDestBuffPos]
.text:0040F966                 sub     eax, ecx
.text:0040F968                 mov     [ebp+pRefData], eax
Here's all like listed in LZS description. But much effective way to implement circular buffer is implementing of "floating window".
mov     edx, [ebp+refData]
add     edx, 12h
mov     eax, [ebp+pDestBuffPos]
sub     eax, edx
mov     [ebp+pRefData], eax
Following native FF8 code successfully process all error cases (it will be listed in next section).

3. Negative offsets in virtual circular buffer
LZS algorithm that looks pretty on circular buffer is full with hacks on plain buffer. When you use virtual circular buffer you should remember about negative offsets.
.text:0040F96B loc_40F96B:                             ; jmp reference
.text:0040F96B                 mov     ecx, [ebp+pRefData]
.text:0040F96E                 cmp     ecx, [ebp+pDestBuff]
.text:0040F971                 jnb     short loc_40F995
.text:0040F973                 mov     edx, [ebp+pDestBuffPos]
.text:0040F976                 cmp     edx, [ebp+pDestBuffRefEnd]
.text:0040F979                 jnb     short loc_40F995
.text:0040F97B                 mov     eax, [ebp+pDestBuffPos]
.text:0040F97E                 mov     byte ptr [eax], 0
.text:0040F981                 mov     ecx, [ebp+pDestBuffPos]
.text:0040F984                 add     ecx, 1
.text:0040F987                 mov     [ebp+pDestBuffPos], ecx
.text:0040F98A                 mov     edx, [ebp+pRefData]
.text:0040F98D                 add     edx, 1
.text:0040F990                 mov     [ebp+pRefData], edx
.text:0040F993                 jmp     short loc_40F96B
I suggest, FF8 was developed on C and compiled without code optimization flags or maybe 10 years ago compilers were weak in optimization. But this code looks very strange. We know all about our buffers. Why do we need to check bounds on every step? Why we need to copy zeroes byte-by-byte? Programmer could call memset function or implement own
mov     ecx, [ebp+pRefData]
cmp     ecx, [ebp+pDestBuff]
jnb     short loc_40F995
push    edi
mov     edi, [ebp+pDestBuffPos]
mov     ecx, [ebp+pDestBuff]
sub     ecx, [ebp+pRefData] ; we already checked bounds, so, pDestBuff > pRefData
mov     eax, [ebp+pDestBuffPos]
add     eax, ecx
mov     [ebp+pDestBuffPos], eax
xor     eax, eax
rep     stosb
pop     edi
This looks a little better. We use (9 + zero_data_size) memory i/o operations instead of (10 * zero_data_size). It's faster when zero_data_size > 1.

4. Outro
I don't want to say that FF8 is a nasty-coded game. But there are many tricking and surgery bypassing code, there are a lot of temporary buffers and copying operations from one to other. All these things make reverse engineering hard but interesting. I hope this article was interesting for you too.