Wednesday, June 16, 2010

Cold boot attack

How safe and secure is your data when stored on a PC/workstation or a laptop? Say you also have whole disc/volume encryption, maybe activated by a USB key. Even then, how secure is your data?

If the computer is turned on when it physically falls in hands of the attacker, you might be at risk even with whole disc encryption enabled and the screen locked requiring your (non compromised) password. This is especially true for portable machines like laptops and net-books. There are many possible attacks, we will look into a specific one called cold boot attack in this post.

Something about memory

Chances are that your main memory is DRAM technology based, which includes popular ones such as SDRAM, DDRx and even the VRAM in most cases. As it is widely considered to be volatile type of memory, we are under the impression that it loses the stored contents after power is cut off. Sadly the contents are not lost instantaneously, but they decay/degrade gradually due to the way DRAM stores data.

Figure 0 : A DRAM Cell
Without going into the details of the dynamic logic, it suffices to say that the bit is stored as charge on a capacitor connected to the rest of the circuit with a transistor which acts like a switch. Simply put, if the capacitor has charge above a certain level, we interpret bit value 1, and 0 otherwise. (each CAS or bit line like the one shown in Figure 0 is actually a set of two bit lines, +ve and -ve which are connected to alternate cells/FETs in the row. When reading, after both bit lines have been precharged, RAS is activated and the sense amplifier kicks in. Due to positive feedback it can detect the slight charge difference between +ve and -ve bit lines.)

Capacitors are not perfect. They lose charge over time as leakage current. Thus, each DRAM cell has to be `refreshed' frequently enough to retain content. Even though manufacturers usually recommend the DRAM be refreshed at least every 64ms (to ensure that there is no data corruption), well built DRAMs have a retention period that is significantly greater than the suggested refresh period. It goes without saying that some bits do corrupt, but the rate at which they do is pretty low when you put 64ms in perspective. This is true even after turning the power off entirely.

How does this concern me?

Well, the significantly long retention period of main memory is a security hole. If a computer is physically compromised when it is powered on, the attacker can read most of the contents stored on your main memory. The main memory includes wealth of information : any user names and passwords that are `cached' by applications, encryption keys for hard drive/volume encryption as well as SSL private keys for active connections, and (maybe partial) contents of any file that is or was open in recent past (as after closing the file, most operating systems do not clear the corresponding memory pages used as block cache, for obvious performance reasons), including portions of deleted files at times as well. This is assuming the machine is locked in some sense, and the unlocking password is not known to the attacker.

At room temperature, the DRAM holds on to the stored bits if the power down/up cycle completes in 64ms. Errors are introduced as powered-off period increases. The period during which no or very few bits corrupt can be significantly increased if the DRAM modules are cooled to near or beyond 0 degree C. Such `cold' modules can then be inserted into another compatible machine and their contents dumped for detailed analysis. I believe that the reason for extended retention is because as the capacitors usually have Aluminum oxide as dielectric whose leakage current reduces as temperature reduces.

A very good paper from Princeton University discussing how to detect keys in the dumps is here. They claim to have broken almost every hard drive encryption solution there is :(. They also provide software to dump memory contents on a portable drive.

This technique is important for forensics/law enforcement as well. During a bust of a computer crime suspect, memory dumps of the machines are extremely important to support the case. This technique allows the law enforcers to get the system state or a snapshot as it was at the time of the bust even without the suspects' co-operation.

Note that the technique is ineffective against other types of memory, for example bistable latched SRAM. So in case you are wondering if you could get something out of the hard drive caches or something from the buffers of a compromised router, you are out of luck!

Does it really work? Can I see it working on my machine?

Sure!

I've created a simple RAM browser (kernel) named `RAMBO' (download source) which, as the name says, can be used to browse the RAM contents after a simulated theft and cold reboot. You can choose to be super-realistic, pulling out the cord of your desktop or pulling out the battery of your laptop and putting it back in a jiffy. If you are worried about pulling the cord while windows/*NIX runs, you can choose to load a file to certain RAM location, reboot and check if the contents are readable after reboot. The program is to be loaded from a multiboot compliant bootloader such as GRUB. It does not touch any other piece of hardware than the processor, memory and keyboard, so rest assured that you will not have a corrupt disc or fried electronics. In case you do not have such a loader installed already, you can install the loader on a floppy or a USB stick and copy the program to it so that you can load it when you boot from USB/floppy. The README provides more info on how to use it.

Tuesday, June 15, 2010

Real mode in C with gcc : writing a bootloader

Usually the x86 boot loader is written in assembler. We will be exploring the possibility of writing one in C language (as much as possible) compiled with gcc, and runs in real mode. Note that you can also use the 16 bit bcc or TurboC compiler, but we will be focusing on gcc in this post. Most open source kernels are compiled with gcc, and it makes sense to write C bootloader with gcc instead of bcc as you get a much cleaner toolchain :)

As of today (20100614), gcc 4.4.4 officially only emits code for protected/long mode and does not support the real mode natively (this may change in future).

Also note that we will not discuss the very fundamentals of booting. This article is fairly advanced and assumes that you know what it takes to write a simple boot-loader in assembler. It is also expected that you know how to write gcc inline assembly. Not everything can be done in C!

getting the tool-chain working


.code16gcc

As we will be running in 16 bit real mode, this tells gas that the assembler was generated by gcc and is intended to be run in real mode. With this directive, gas automatically adds addr32 prefix wherever required. For each C file which contains code to be run in real mode, this directive should be present at the top of effectively generated assembler code. This can be ensured by defining in a header and including it before any other.

#ifndef _CODE16GCC_H_
#define _CODE16GCC_H_
__asm__(".code16gcc\n");
#endif

This is great for bootloaders as well as parts of kernel that must run in real mode but are desired written in C instead of asm. In my opinion C code is a lot easier to debug and maintain than asm code, at expense of code size and performance at times.

Special linking


As bootloader is supposed to run at physical 0x7C00, we need to tell that to linker. The mbr/vbr should end with the proper boot signature 0xaa55.
All this can be taken care of by a simple linker script.

ENTRY(main);
SECTIONS
{
    . = 0x7C00;
    .text : AT(0x7C00)
    {
        _text = .;
        *(.text);
        _text_end = .;
    }
    .data :
    {
        _data = .;
        *(.bss);
        *(.bss*);
        *(.data);
        *(.rodata*);
        *(COMMON)
        _data_end = .;
    }
    .sig : AT(0x7DFE)
    {
        SHORT(0xaa55);
    }
    /DISCARD/ :
    {
        *(.note*);
        *(.iplt*);
        *(.igot*);
        *(.rel*);
        *(.comment);
/* add any unwanted sections spewed out by your version of gcc and flags here */
    }
}

gcc emits elf binaries with sections, whereas a bootloader is a monolithic plain binary with no sections. Conversion from elf to binary can be done as follows:

$ objcopy -O binary vbr.elf vbr.bin

The code

With the toolchain set up, we can start writing our hello world bootloader!
vbr.c (the only source file) looks something like this:

/*
 * A simple bootloader skeleton for x86, using gcc.
 *
 * Prashant Borole (boroleprashant at Google mail)
 * */

/* XXX these must be at top */
#include "code16gcc.h"
__asm__ ("jmpl  $0, $main\n");


#define __NOINLINE  __attribute__((noinline))
#define __REGPARM   __attribute__ ((regparm(3)))
#define __NORETURN  __attribute__((noreturn))

/* BIOS interrupts must be done with inline assembly */
void    __NOINLINE __REGPARM print(const char   *s){
        while(*s){
                __asm__ __volatile__ ("int  $0x10" : : "a"(0x0E00 | *s), "b"(7));
                s++;
        }
}
/* and for everything else you can use C! Be it traversing the filesystem, or verifying the kernel image etc.*/

void __NORETURN main(){
    print("woo hoo!\r\n:)");
    while(1);
}


compile it as

$ gcc -c -g -Os -march=i686 -ffreestanding -Wall -Werror -I. -o vbr.o vbr.c
$ ld -static -Tlinker.ld -nostdlib --nmagic -o vbr.elf vbr.o
$ objcopy -O binary vbr.elf vbr.bin

and that should have created vbr.elf file (which you can use as a symbols file with gdb for source level debugging the vbr with gdbstub and qemu/bochs) as well as 512 byte vbr.bin. To test it, first create a dummy 1.44M floppy image, and overwrite it's mbr by vbr.bin with dd.

$ dd if=/dev/zero of=floppy.img bs=1024 count=1440
$ dd if=vbr.bin of=floppy.img bs=1 count=512 conv=notrunc

and now we are ready to test it out :D

$ qemu -fda floppy.img -boot a

and you should see the message!

Once you get to this stage, you are pretty much set with respect to the tooling itself. Now you can go ahead and write code to read the filesystem, search for next stage or kernel and pass control to it.

Here is a simple example of a floppy boot record with no filesystem, and the next stage or kernel written to the floppy immediately after the boot record. The next image LMA and entry are fixed in a bunch of macros. It simply reads the image starting one sector after boot record and passes control to it. There are many obvious holes, which I left open for sake of brevity.

/*
 * A simple bootloader skeleton for x86, using gcc.
 *
 * Prashant Borole (boroleprashant at Google mail)
 * */

/* XXX these must be at top */
#include "code16gcc.h"
__asm__ ("jmpl  $0, $main\n");


#define __NOINLINE  __attribute__((noinline))
#define __REGPARM   __attribute__ ((regparm(3)))
#define __PACKED    __attribute__((packed))
#define __NORETURN  __attribute__((noreturn))

#define IMAGE_SIZE  8192
#define BLOCK_SIZE  512
#define IMAGE_LMA   0x8000
#define IMAGE_ENTRY 0x800c

/* BIOS interrupts must be done with inline assembly */
void    __NOINLINE __REGPARM print(const char   *s){
        while(*s){
                __asm__ __volatile__ ("int  $0x10" : : "a"(0x0E00 | *s), "b"(7));
                s++;
        }
}

#if 0
/* use this for the HD/USB/Optical boot sector */
typedef struct __PACKED TAGaddress_packet_t{
    char                size;
    char                :8;
    unsigned short      blocks;
    unsigned short      buffer_offset;
    unsigned short      buffer_segment;
    unsigned long long  lba;
    unsigned long long  flat_buffer;
}address_packet_t ;

int __REGPARM lba_read(const void   *buffer, unsigned int   lba, unsigned short blocks, unsigned char   bios_drive){
        int i;
        unsigned short  failed = 0;
        address_packet_t    packet = {.size = sizeof(address_packet_t), .blocks = blocks, .buffer_offset = 0xFFFF, .buffer_segment = 0xFFFF, .lba = lba, .flat_buffer = (unsigned long)buffer};
        for(i = 0; i < 3; i++){
                packet.blocks = blocks;
                __asm__ __volatile__ (
                                "movw   $0, %0\n"
                                "int    $0x13\n"
                                "setcb  %0\n"
                                :"=m"(failed) : "a"(0x4200), "d"(bios_drive), "S"(&packet) : "cc" );
                /* do something with the error_code */
                if(!failed)
                        break;
        }
        return failed;
}
#else
/* use for floppy, or as a fallback */
typedef struct {
        unsigned char   spt;
        unsigned char   numh;
}drive_params_t;

int __REGPARM __NOINLINE get_drive_params(drive_params_t    *p, unsigned char   bios_drive){
        unsigned short  failed = 0;
        unsigned short  tmp1, tmp2;
        __asm__ __volatile__
            (
             "movw  $0, %0\n"
             "int   $0x13\n"
             "setcb %0\n"
             : "=m"(failed), "=c"(tmp1), "=d"(tmp2)
             : "a"(0x0800), "d"(bios_drive), "D"(0)
             : "cc", "bx"
            );
        if(failed)
                return failed;
        p->spt = tmp1 & 0x3F;
        p->numh = tmp2 >> 8;
        return failed;
}

int __REGPARM __NOINLINE lba_read(const void    *buffer, unsigned int   lba, unsigned char  blocks, unsigned char   bios_drive, drive_params_t  *p){
        unsigned char   c, h, s;
        c = lba / (p->numh * p->spt);
        unsigned short  t = lba % (p->numh * p->spt);
        h = t / p->spt;
        s = (t % p->spt) + 1;
        unsigned char   failed = 0;
        unsigned char   num_blocks_transferred = 0;
        __asm__ __volatile__
            (
             "movw  $0, %0\n"
             "int   $0x13\n"
             "setcb %0"
             : "=m"(failed), "=a"(num_blocks_transferred)
             : "a"(0x0200 | blocks), "c"((s << 8) | s), "d"((h << 8) | bios_drive), "b"(buffer)
            );
        return failed || (num_blocks_transferred != blocks);
}
#endif

/* and for everything else you can use C! Be it traversing the filesystem, or verifying the kernel image etc.*/

void __NORETURN main(){
        unsigned char   bios_drive = 0;
        __asm__ __volatile__("movb  %%dl, %0" : "=r"(bios_drive));      /* the BIOS drive number of the device we booted from is passed in dl register */

        drive_params_t  p = {};
        get_drive_params(&p, bios_drive);

        void    *buff = (void*)IMAGE_LMA;
        unsigned short  num_blocks = ((IMAGE_SIZE / BLOCK_SIZE) + (IMAGE_SIZE % BLOCK_SIZE == 0 ? 0 : 1));
        if(lba_read(buff, 1, num_blocks, bios_drive, &p) != 0){
            print("read error :(\r\n");
            while(1);
        }
        print("Running next image...\r\n");
        void*   e = (void*)IMAGE_ENTRY;
        __asm__ __volatile__("" : : "d"(bios_drive));
        goto    *e;
}


removing __NOINLINE may result in even smaller code in this case. I had it in place so that I could figure out what was happening.

Concluding remarks

C in no way matches the code size and performance of hand tuned size/speed optimized assembler. Also, because of an extra byte (0x66, 0x67) wasted (in addr32) with almost every instruction, it is highly unlikely that you can cram up the same amount of functionality as assembler.

Global and static variables, initialized as well as uninitialized, can quickly fill those precious 446 bytes. Changing them to local and passing around instead may increase or decrease size; there is no thumb rule and it has to be worked out on per case basis. Same goes for function in-lining.

You also need to be extremely careful with various gcc optimization flags. For example, if you have a loop in your code whose number of iterations are small and deducible at compile time, and the loop body is relatively small (even 20 bytes), with default -Os, gcc will unroll that loop. If the loop is not unrolled (-fno-tree-loop-optimize), you might be able to shave off big chunk of bytes there. Same holds true for frame setups on i386 - you may want to get rid of them whenever not required using -fomit-frame-pointer. Moral of the story : you need to be extra careful with gcc flags as well as version update. This is not much of an issue for other real mode modules of the kernel where size is not of this prime importance.

Also, you must be very cautious with assembler warnings when compiling with .code16gcc. Truncation is common. It is a very good idea to use --save-temp and analyze the assembler code generated from your C and inline assembly. Always take care not to mess with the C calling convention in inline assembly and meticulously check and update the clobber list for inline assembly doing BIOS or APM calls (but you already knew it, right?).

It is likely that you want to switch to protected/long mode as early as possible, though. Even then, I still think that maintainability wins over asm's size/speed in case of a bootloader as well as the real mode portions of the kernel.

It would be interesting if someone could try this with c++/java/fortran. Please let me know if you do!