WORK IN PROGRESS!!!
Making size limited 1k/4k intros for OS X can be said it is like hybrid from Windows and unix systems. On the other hand tricks like import-by-hash dynamic loading is rather starightforward to do like in Windows, on the other hand the unix backend allows some nifty things like shell dropping.
In high-level there are certain pros and cons working with OS X in size-limited world.
Pros:
Cons:
Fundamentals (32 bit executables)
The OS X native executable format, Mach-O, contains bare minimum header and load commands followed by content for the segments defined. A practical executable needs segment, main, dylinker and at least one dylib command.
Due to recent jailbreak exploits the kernel, Apple has made the kernel side format validation really strict. Mach-O header and load commands needs to be of correct size and offsets need to point into the correct command issued. Only commands loaded by dyld can be still fudged around. Practically this means that only library list and sdk-header can be modified, which does not optimize much but allows making header more regular for compression purposes.
Example of minimal executable with 180-bytes of header
org 0
bits 32
%include "symbols.asm"
FileStart:
[section .text]
MACHHeader:
dd 0xfeedface ; magic
dd SOLVE(CPU_TYPE_X86) ; cpu type
dd 0 ; cpu subtype (wrong, but works)
dd SOLVE(MH_EXECUTE) ; filetype
dd 4 ; number of commands
dd MACHCommandsEnd-MACHCommandsStart ; size of commands
dd 0 ; flags
MACHHeaderEnd:
MACHCommandsStart:
Command1Start:
dd SOLVE(LC_MAIN) ; this is main
dd Command1End-Command1Start ; size
dd CodeStart-MACHHeader ; eip
dd 0
dd 0 ; stack
dd 0
Command1End:
Command2Start:
dd SOLVE(LC_SEGMENT) ; this is segment
dd Command2End-Command2Start ; size
SegNameStart:
times 16-$+SegNameStart db 0 ; section name
dd 0 ; vmaddr
dd 0x100000 ; vmsize
dd 0 ; fileoff
dd FileEnd ; filesize
dd 0 ; maxprot
dd SOLVE(VM_PROT_READ)|SOLVE(VM_PROT_WRITE)|SOLVE(VM_PROT_EXECUTE) ; initprot
dd 0 ; nsects
dd 0
Command2End:
Command3Start:
dd SOLVE(LC_LOAD_DYLINKER) ; dyld
dd Command3End-Command3Start ; size
dd DyldName-Command3Start ; offset
DyldName:
db '/usr/lib/dyld',0 ; 14 bytes
; db 0,0 ; breaking the spec here, no-one minds.
Command3End:
Command4Start:
dd SOLVE(LC_LOAD_DYLIB)
dd Command4End-Command4Start ; size
dd DyName-Command4Start ; offset
dd 0 ; timestamp
dd 0 ; min ver
dd 0 ; max ver
DyName:
db 'Cocoa.framework/Cocoa',0
Command4End:
MACHCommandsEnd:
CodeStart:
ret
CodeEnd:
times 4096-(CodeEnd-MACHHeader) db 0
FileEnd:
Please see the definition of SOLVE() macro in the tools/examples section
Obviously the next question is how to get hand to the libraries loaded. There are two known ways to do it, assuming the hash-loading is the desired way to resolved:
After the address of the library is known (as well as the name) resolving the symbols is rather trivial. The library address points to the mach-o header of the library and seeking the dysymtab where function addresses and names can be found.
The only remaining problem is that where the __dyld-functions can be found. Fortunately for us dyld is loaded into a static address 0x8fe00000 + ASLR fudge, which is fortunately available at the start of the execution (and only at the start of the execution, it will be gone after that)
Compression
There has been many variants of PPM-like compressions for the size coded entries, which is still algorithm used by many intro tools (Crinkler, elfling). For OS X we can use onekpaq that has Crinkler-like performance. (Tool is open source and can be used for other platforms as well)
Due to verbose headers and 4k minimum size limitation it is almost always mandatory to use shell-dropping, even if the actual executable is already compressed by some better compressor. Recede/TDA was the first Mac 4k to feature double compression. onekpaq compressed the content and shell dropper was used to get the header minimized.
Most of the same tricks that are valid for other unix are also valid for OS X. However OS X does not have xz, the favorite compressor of BSD and Linux sizecoders, we have to get by using gzip/bzip2 if we want to do shell dropping.
By interleaving the gzip header (6 free bytes) with the shell dropper we can have the shortest possible shelldropper, 34 bytes:
First line:
cp $0 /tmp/z;(sed 1d $0|zcat
and in gzip header (offset +4, flags needs to be 0x10)
)>$_;$_
)
Firehawk originally found out the “sed 1d”-trick, now it is widely used. This header could be minimized more if someday Apple starts to install tac or changes default shell zsh, both of which are very unlikely to happen.
Breakdown of TDA/Affinity 1k intro (uncompressed sizes in parenthesis) Using a simple shell dropper without any second stage compression.
Content | Size |
---|---|
Shell dropping | 44 bytes |
Mach-O header | 90 (180) bytes |
Import by hash + setup | 101 (103) bytes |
Padding to 4k | 8 (1235) bytes |
Hashes for libs/functions | 69 (65) bytes |
OpenGL + midi code | 223 (344) bytes |
Midi notes | 42 (45) bytes |
Shader | 446 (2124) bytes |
Total | 1023 (2861) bytes |
Total: 246 bytes for overhead + hash for dynloader, 777 bytes content + hashes
OpenGL
Working with OpenGL in OS X is not the same as in windows. You can’t do anything outside of the spec without it breaking down completely. The following examples show to make a simple fragment shader based intro either in Legacy profile or in Core profile.
Notable things are that both example start with CGL-setup for GL-context and use GGSGetKeys + glSwapApple since they are the most efficient way of getting things done. Other possibilities for setting up GL is either using Cocoa directly (NSOpenGLView) or Legacy AGL. However, in practice CGL way seems to be the shortest.
When using legacy OpenGL version-header can be omitted, but in many cases “#version 120” is beneficial since it allows using automatic conversions in the code making the GLSL-coding much less WebGL-like. IN legacy shader color is a simple way to have (clamped) uniform input, instead of doing a real uniform. The other option is texture coordinate: native code length is the same but shader code is a bit more longer (better if unclamped value is needed)
Legacy profile minimum shader-intro (743 bytes with laturi):
#include <OpenGL/OpenGL.h>
#include <ApplicationServices/ApplicationServices.h>
#include <Carbon/Carbon.h>
#include <OpenGL/glu.h>
#include <OpenGL/CGLTypes.h>
#include <OpenGL/CGLContext.h>
extern void CGSGetKeys(KeyMap k);
static const GLchar *fragment_shader=
"void main()"
"{"
"float time=gl_TexCoord[0].x;"
"vec2 pos=gl_FragCoord.xy;"
"vec4 c=vec4(.2);"
"if (length(mod(pos*.001,.25)-.125)<mod(time*.01,.25)) c+=.5;"
"gl_FragColor=c;"
"}";
void main(void)
{
CGDisplayHideCursor(0); // parameter not used by Quartz
static const CGLPixelFormatAttribute attribs[]={0}; // anything goes.
CGLPixelFormatObj formats;
GLint num_pix;
CGLChoosePixelFormat(attribs,&formats,&num_pix);
CGLContextObj ctx;
CGLCreateContext(formats,0,&ctx); // first hit is good enough for us.
CGLSetCurrentContext(ctx);
CGLSetFullScreenOnDisplay(ctx,CGDisplayIDToOpenGLDisplayMask(0));
GLuint shader=glCreateShader(GL_FRAGMENT_SHADER); // now shader
glShaderSource(shader,1,&fragment_shader,0);
glCompileShader(shader);
GLuint program=glCreateProgram();
glAttachShader(program,shader);
glLinkProgram(program);
glUseProgram(program);
float time=0;
for (;;)
{
glTexCoord1f(time); // set shader timing, should come from music
time+=.16;
glRecti(-1,-1,1,1); // actual drawing
glSwapAPPLE();
KeyMap keys;
CGSGetKeys(keys);
if (((unsigned char*)keys)[6]&0x20) break;
}
}
When using Core profile in OS X, all of the Legacy calls are unavailable (like glRecti). Other things that OS X mandates is a proper VAO, both vertex and fragment shader must exist. Using good old (Nvidia) trick to generate actual coordinates in vertex shader, using the VertexID as animation counter and separarable shaders we can squueze a bit out of the code.
Unfortunately when using separable shaders gl_Position needs to be defined in vertex shader thus tipping the size a bit unfavorable when comparing to legacy pipeline.
In any case the difference is quite small and it really depends on context which one is better.
Core profile minimum shader-intro (771 bytes with laturi):
#include <OpenGL/OpenGL.h>
#include <OpenGL/GL3.h>
#include <ApplicationServices/ApplicationServices.h>
#include <Carbon/Carbon.h>
#include <OpenGL/glu.h>
#include <OpenGL/CGLTypes.h>
#include <OpenGL/CGLContext.h>
extern void CGSGetKeys(KeyMap k);
const GLchar* vertex_shader =
"#version 330\n"
"out gl_PerVertex{vec4 gl_Position;};"
"void main(){"
"gl_Position=vec4(gl_VertexID%3>>1,gl_VertexID%3&1,gl_VertexID/3*.001,.5)-.25;"
"}";
static const GLchar *fragment_shader=
"#version 330\n"
"out vec4 g;"
"void main()"
"{"
"float time=gl_FragCoord.z*100;"
"vec2 pos=gl_FragCoord.xy;"
"vec4 c=vec4(.2);"
"if (length(mod(pos*.001,.25)-.125)<mod(time*.01,.25)) c+=.5;"
"g=c;"
"}";
void main(void)
{
CGDisplayHideCursor(0); // parameter not used by Quartz
static const CGLPixelFormatAttribute attribs[]={kCGLPFAOpenGLProfile,kCGLOGLPVersion_3_2_Core,0};
CGLPixelFormatObj formats;
GLint num_pix;
CGLChoosePixelFormat(attribs,&formats,&num_pix);
CGLContextObj ctx;
CGLCreateContext(formats,0,&ctx); // first hit is good enough for us.
CGLSetCurrentContext(ctx);
CGLSetFullScreenOnDisplay(ctx,CGDisplayIDToOpenGLDisplayMask(0));
GLuint fprogram=glCreateShaderProgramv(GL_FRAGMENT_SHADER,1,&fragment_shader);
GLuint vprogram=glCreateShaderProgramv(GL_VERTEX_SHADER,1,&vertex_shader);
GLuint pipeline;
glGenProgramPipelines(1,&pipeline);
glUseProgramStages(pipeline,GL_FRAGMENT_SHADER_BIT,fprogram);
glUseProgramStages(pipeline,GL_VERTEX_SHADER_BIT,vprogram);
glBindProgramPipeline(pipeline);
GLuint vao;
glGenVertexArrays(1,&vao);
glBindVertexArray(vao);
int frame=0;
for (;;)
{
frame++;
glDrawArrays(GL_TRIANGLES,frame*3,3);
glSwapAPPLE();
KeyMap keys;
CGSGetKeys(keys);
if (((unsigned char*)keys)[6]&0x20) break;
}
}
Audio
Tools
Examples