Linux is at advantage by the fact that most standard Linux distributions ship a lot more libraries than Win32 which you can assume being available in the system. Here is the list of packages that come by default with Ubuntu 19.04. Some of the useful ones are e.g. SDL which you can use as the base framework for the application giving you graphics output and event handling. SDL also provides sound output capabilities. Not all parties have SDL installed though, as it doesn’t come by default with Ubuntu! Also, usually, only the 64-bit versions of the libraries are installed.
More resources:
Linux (and most other Unix-like systems) also has the advantage (compared to
many other operating systems) that you don’t need to use any libraries at all
if you don’t want to, because everyhing can be done through kernel-level system
calls (the int
opcode in x86 assembly, or syscall
in 64-bit mode). This is
a totally different approach than the previously mentioned one.
The exact mechanics of how to make a syscall can be found in the “Whirlwind Tutorial on Creating Teensy ELF Executables”.
Audio output can be achieved easily by creating a pipe (using the pipe
syscall), fork
ing, dup2
in the read end of the pipe to stdin
and execve
ing
the child to /usr/bin/aplay
. Then, the parent process can simply write PCM
data to the pipe, and sound will appear.
Here is a prod that does just this.
The visual side is more complicated. Framebuffer devices (/dev/fd*
for pixels,
/dev/vcsa
for tiles, or direct /dev/dri/card0
access) are simple to access
for pixel rendering, but they’re not very easy to access on desktops or laptops.
Therefore, if you want to do decent visuals in a compatible way (and text
terminal output is not enough for you), you must learn how to talk the X11
protocol in a low level. Basically, you establish a connection to
localhost:6000, send the initial request, grab the id’s you require from the
response, open a window, etc. etc. It may be a bit long-winded, but you also
get a low-level access to extensions such as XVideo and GLX. However, if you
want access to the GPU (and don’t want to code your own GPU drivers), this
isn’t exactly an option as the GPU drivers need to be loaded from dynamic
libraries.
Another approach usable for 4K intros on Unix platforms is self-compiling. That is, you actually distribute the intro as an executable source code package that decompresses, compiles, executes and deletes the actual intro. A shell script stub attached to the compressed source code of an SDL-based intro could perhaps be something like this:
a=/tmp/I;zcat<<X>$a;cc `sdl-config --libs --cflags` -o $a. $a;$a.;rm $a.;exit
ctypes
library. See eg.
this prod. This way you can
avoid the overead in C coding, but you’ll have to mess with the ctypes
library bugs to make it work in 64-bit mode.I’m assuming GCC or Clang here.
First of all, you want a few obvious (and less obvious) optimization plugins:
Optimization:
-flto -fuse-linker-plugin -ffast-math
-O2/-Os (always try both, -O2 sometimes generates smaller code than -Os, but
the former also aligns branch targets to 16 bytes...)
-march=core2 <--- usually seems to generate smaller code? not in all cases?
You might also want to disable all SSE/MMX/AVX stuff, as mixing these with
'regular' x86 will make the code less compressable. (Unless those instructions
are concentrated in one big block.) However, compilers like to use those for
non-'SIMD-purposes' very much, such as copying structs around.
Avoid ELF and ABI bloat:
-fno-plt -fno-stack-protector -fno-stack-check -fno-unwind-tables
-fno-asynchronous-unwind-tables -fomit-frame-pointer -fno-pic -fno-PIE -fno-PIC
-ffunction-sections -fdata-sections
The last two switches causes every function/variable to be placed in its own
section, after which the linker can remove unused ones using the --gc-sections
flag.
Avoiding linker bloat:
-Wl,--build-id=none -z norelro -wl,-z,noseparate-code -Wl,--no-eh-frame-hdr
-no-pie -Wl,--no-ld-generated-unwind-info -Wl,--hash-style=sysv
-Wl,-z,nodynamic-undefined-weak
-Wl,--gc-sections <--- remove all unused functions/global variables/sections
Throw away C++ crap:
-fno-exceptions -fno-rtti -fno-threadsafe-statics
A second trick, useful for dealing with custom linkers while still reaping
the benefits of some of the flags of GNU ld, is incremental linking: linking
multiple .o
files into a new one that can be linked into an executable. That
way, you can still use things like LTO (which eg. smol (see below) isn’t able
to deal with). You use it by passing -Wl,-i
to GCC. For GCC 9 and later, you
also have to pass in -flinker-output=nolto-rel
in order to actually apply the
LTO and output a normal object file.
Instead of importing functions that implement syscalls (like open
) or ones
that do basic memory or math stuff (like memcpy
and sin
), it’s a better
idea to implement these yourself using inline assembly. For example, a memcpy
call can be turned into a single rep stosb
instruction using the following
snippet:
inline __attribute__((__force_inline__)) // make gcc/clang ALWAYS inline this function
static void ASM_memcpy(void* dst, const void* src, size_t size) {
asm volatile /* volatile prevents reordering of the asm code between other
C code etc. */ (
"rep stosb\n" // the asm code
: // output variables (none), in the format "=format"(C_varname)[ASM_varname]
:"S"(src),"D"(dst),"c"(size) // input variables, esi/rsi == src, edi/rdi == dst, ecx/rcx == size
: // other, destroyed registers (none)
);
}
A collection of a nubmer of syscalls implemented this way can be found here (+ dependency).
Other tricks for getting the code size down have been discussed at lenght in other articles in this wiki.
First of all, there’s the startup code. Basically, the main
function you
wrote isn’t the real first thing that gets executed. The very first code
that actually runs in the process – ignoring the dynamic linker for a moment
– can be found in the compiler’s crt*
files. Those files fetch the argc
,
argv
, environment variables and the
ELF auxiliary vector from
the kernel, align the stack, initialize the internal libc components, and only
then it jumps to main
. (For more details, visit the Linux Sizecoding Wiki.)
However, this code is rather large, and it’s easy to bypass this: -nostartfiles
.
But, if you use this flag without thinking, you’ll get another error: “function
_start
not found”. This is the real entrypoint, so just rename your main
to _start
. Then it will run, but your argc
and argv
and environ
will be
garbage. Also, you might notice random crashes in random places that use SSE
instructions.
This means, you’ll have to initialize and align the stack yourself. Fortunately,
this isn’t too hard: argc
and argv
etc. are placed on the stack at process
entry as shown in the ASCII diagram in the midddle of
this article. Then call __libc_start_main
with main
, argc
, argv
, three null pointers, and finally the initial stack
value. An example of doing this can be found
here, except fetching
the stack pointer value is left as an exercise for the reader.
(Just mov foo, esp/rsp
isn’t a solution because the compiler inserts a
function prologue.)
See the next section
As ‘vanilla’ ELF files can get quite large because of the import and relocation tables, a few tricks need to be used to decrease the file size.
This is arguably the easiest and most stable option, but isn’t the most effective, size-wise. However, ld doesn’t like to output small binaries, so you have to bully it a little bit:
.text
(code) and
.rodata
(readonly data) sections, discards .gnu.warning
, .note.*
,
.gnu_debuglink
, .comment
, .gnu.attributes
, .eh_frame*
, .gcc_except*
,
.gnu_extab*
, .exception*
, etc. Get the default linker script using -Wl,-v
,
then modify it. See the GNU ld docs for the syntax.objcopy -R .gnu.version -R .gnu.version_r
.objcopy
.
This actually disables the use of the versioning info.strip -s
. This removes symbol and debug data etc, but leaves the section
headers in. Also removes remaining versioning data now that it is no longer
used thanks to noelfver.sstrip -z
to also remove the section headers (from ELF Kickers)chmod -x
dnload is a tool by Trilkk/faemiyah that takes C/C++ and GLSL source code and turns it into a packed binary. It works out of the box, but only supports a fixed set of libraries and symbols.
smol (‘staging’ branch) is an experimental, currently-work-in-progress size-optimizing linker that is able to crunch the size of one’s binary a bit further than dnload, ELF-header-wise. It doesn’t come with batteries included, and its options can get a bit confusing, so getting it to work takes some effort.
Many other tools you might have heard about, such as bold, elfling, crinklux, …, have fell prey to bitrot and are currently unusable, as circumstances have changed: 32-bit ELFs are a dead end, as there are no libraries, and GCC is emitting different relocation types by now so that programs like Bold that have their own linking logic break as well. It’s a sad state of affairs, but it is what it is.
For details on how these things could be implemented, go to the Linux Sizecoding Wiki.
As most intro synths (4klang, Clinkster, Oidos) are self-contained, they can be used without too much effort on Linux as well. Furthermore, 64klang2 is made to work on Linux ‘by default’, and there are a number of other synths (such as ghostsyn from faemiyah) that were made especially for Linux (or other Unices).
However, for the trio of 4klang, Clinkster and Oidos, there are a few complications: they need some linking voodoo to get them to link properly, and they’re 32-bit only. Solutions for the former can be found here, and for the latter, one can use a dirty trick:
Make a standalone, static, 32-bit binary that only plays the music (as
explained in the section on library-less coding), embed that executable in the
main binary, and start that on launch. Sync the audio and video using raw
read
/write
syscalls to a pipe, as read
will block when there is no data
available.
As SDL etc. are sometimes not available, one might try to open a window using libX11, and create an OpenGL context using GLX or EGL. However, using libX11 is often quite brittle, and these three APIS are very verbose.
But, there is a library called libgtk, and it has an API to create an OpenGL context! So, we can just use that. An example for that can be found here.
However, if you only need a single fullscreen that doesn’t take any input and only has to be shown directly to the screen, you can also use libclutter, which has functions for that. See here or here.
If you’re lucky, the compo rules might allow the use of SDL, or even GLFW. Use it! It makes initialization much easier and smaller, but, it mightn’t be available everywhere, so always have a GTK or Clutter backup version.
A quick and easy way to compress your executable is to use a shell script dropper: a small shell script that unpacks and runs itself, followed directly by your compressed intro.
Using a shell dropper, you can compress your intro using only tools of which
decompressors are installed by default – gzip, xz, lzma, bzip2, etc. Gzip
(gzip -cnk9
for compression, zcat
for decompression) usually isn’t very
good – zopfli might get you slightly better compression results, but xz (xz
for compression, xzcat
for decompression, see manpage for flags) is usually
much better. However, standard xz files have large headers compared to gzip.
Therefore, it’s a better idea to use xz in LZMA1 mode (xz --lzma1
), which has
slightly worse compression, but much smaller headers. With these small sizes,
LZMA1 usually ends up performing slightly better than xz because of this.
The smallest shell dropper I’ve come across is this one, which takes 42 bytes:
cp $0 /tmp/M;(sed 1d $0|xzcat)>$_;$_;exit
<compressed data>
Of course, this assumes that /tmp
is executable. One could change the /tmp/M
to just M
to put it in the current directory, but that isn’t guaranteed to
be writable either!
However, there is a known ‘fix’ for this: using an in-memory dropper written in assembly. vondehi is currently the best of those. It basically does the same things as the shell script, except everything is performed in RAM. Simply take the output of the NASM file and append your compressed data.
If you want to tune the compression parameters to get a slightly smaller output file, there’s a script called autovndh.py that bruteforces all options, and optionally also prepends a vondehi stub as well.
This still isn’t the best compression ratio possible, however. As Crinkler on Windows performs much better because of its use of a PAQ-style algorithm (as opposed to LZMA), the tooling isn’t finished yet. Work has been done towards reaching a Crinkler equivalent, as there are some obsolete tools that do this, as well as a current work-in-progress project called XLINK by unlord/xylem. The latter isn’t usable yet, though.