Hello ladies and gentlemen, Royal readers of my blog !
No more jokes, so i wrote this post in english, consequently i need make some task at other languages(to study)… keep warning and prepare your eyes… (will be hard experience, my english is not very good)
In last week following search algorithms, like a try to gain some performance at my private projects, i view some thing about “SSE4.2“. so when i view the possibility to use “xmm0″(a register of 128 bits), thinking “oh my god ! i wanna use it ! this is awesome!”, some days studying it with my friend João Victorino aka “Pl4kt0n”, After studying the concepts around SSE4.2, I ended up writing a program.
Relax brows ! don’t have karate trick here !
To explain, i make two functions, one with the simple function “strcmp()”, other with my implementation using SSE4.2 with Assembly ( i change AT&T to Intel syntax(“AT&T” is very boring ), for the reason that i guess easy to follow examples of the manual‘intel’s manual’), other fact, i test my “strcmp()” function with “array of words”, to carry some results like “CPU cycles” to make the benchmark, so with it, we have some conditions to compare, just a cartesian choice to view and compare like a simple plot bar with “gnuplot“.
You can view result here ! and gnuplot cmd here!
Ok Cooler_ , what’s the trick ?
So there is no trick, generic condition results in common result, then following other way to find uncommon result…
This code doesn’t have trick, i use instruction “pcmpistri”(Packed Compare Implicit LengthStrings, Return Index) and the “movdqu”(move unaligned double quadword) instruction must be used to transfer data from this into an XMM register, this istructions you can make many things around “strings”, take a look at the following:
global strcmp_sse42_64 ; by Cooler_ c00f3r[at]gmail[dot]com ; 64 bit ; nasm -f elf64 code.s -o code.o ; int strcmp_sse42_64(const char *, const char *); // declare in C code strcmp_sse42_64: push rbp mov rbp, rsp mov rax, rdi mov rdx, rsi sub rax, rdx sub rdx, 32 strloop_64: add rdx, 32 movdqu xmm0, [rdx] pcmpistri xmm0, [rdx+rax], 0011000b ;compare... jump again if above... ja strloop_64 jc blockmov_64 ; jump 2 movzx xor rax, rax ; clear return result... jmp quit blockmov_64: add rax, rdx movzx rax, byte[rax+rcx] ; move with zero movzx rdx, byte[rdx+rcx] sub rax, rdx quit: pop rbp retSo i use it to hook functions 32bit and 64bit version:
#if UINTPTR_MAX == 0xffffffff static int (*strcmp_sse42)(const char *, const char *) = strcmp_sse42_32; #elif UINTPTR_MAX == 0xffffffffffffffff static int (*strcmp_sse42)(const char *, const char *) = strcmp_sse42_64; #else fprintf(stderr,"error in arch\n"); exit(0); #endif
Before hooking it, you need to check whether or not your machine has SSE4.2 support. There are many ways of doing it, however, for the sake of simplicity, let’s go with the following one:
void cpu_get(int* cpuinfo, int info) { #if UINTPTR_MAX == 0xffffffff __asm__ __volatile__( "xchg %%ebx, %%edi;" "cpuid;" "xchg %%ebx, %%edi;" :"=a" (cpuinfo[0]), "=D" (cpuinfo[1]), "=c" (cpuinfo[2]), "=d" (cpuinfo[3]) :"0" (info) ); #elif UINTPTR_MAX == 0xffffffffffffffff __asm__ __volatile__( "xchg %%rbx, %%rdi;" "cpuid;" "xchg %%rbx, %%rdi;" :"=a" (cpuinfo[0]), "=D" (cpuinfo[1]), "=c" (cpuinfo[2]), "=d" (cpuinfo[3]) :"0" (info) ); #endif } void test_sse42_enable() { int cpuinfo[4]; int sse42=0; cpu_get(cpuinfo,1); sse42=cpuinfo[2] & (1 << 20) || 0; if(sse42) puts("SSE4.2 Test...\n OK SSE 4.2 instruction enable !\n"); else { puts("SSE4.2 Not enabled\n your CPU need SSE 4.2 instruction to run this programm\n"); exit(0); } }
look all source code here!
$ git clone https://github.com/CoolerVoid/cooler_sse42_strcmp $ make; ./test SSE4.2 Test… OK SSE 4.2 instruction enable ! ::: strcmp() with SSE42: 2812 cicles Array size of words is: 245 Benchmark strcmp() with SSE42 matchs is: 84 ::: simple strcmp(): 12663 cicles Array size of words is: 245 Benchmark strcmp() matchs is: 84 $ cat /proc/cpuinfo | grep “model name” model name : Intel(R) Core(TM) i5-4690K CPU @ 3.50GHz $ gcc -v | grep “gcc version” gcc version 4.8.3 20140911 (Red Hat 4.8.3-7) (GCC) $ uname -a Linux localhost.localdomain 3.15.10-201.fc20.i686 #1 SMP Wed Aug 27 21:33:30 UTC 2014 i686 i686 i386 GNU/Linux
SSE is very common in image processing, game developers use it too, take a look at the following:
https://software.intel.com/en-us/articles/using-intel-streaming-simd-extensions-and-intel-integrated-performance-primitives-to-accelerate-algorithmsDo you like CPU features ? look this
well well well a cup of Moloko to my little Droogies
my fifty cents ! CHEERS !
Impressing improvement on execution speed! It seems that I should take a glance at SSE.
ReplyDeleteBTW, it's only my personal opinion, but I find the post a little bit hard to read. Maybe the background could be less distracting.
So thanks to feedback... look this https://software.intel.com/en-us/articles/introduction-to-intel-advanced-vector-extensions
ReplyDeleteI change backgorund of blog template, try read again...
cheers
Yeah, it's better now :) Thanks for the article!
DeleteDid you try you function on Windows? I've just recompile it by FASM and it returns incorrect results.
ReplyDeleteTry with NASM https://github.com/CoolerVoid/cooler_sse42_strcmp/blob/master/Makefile#L11 and change some compiler parameters to run at windows...
ReplyDeleteYes, I'm able to compile it by NASM at Windows (just need to add "section .text" after "global.." declaration), but have the same invalid results for any input parameter.
DeleteI've found the issue - VS x64 calling conversion uses RCX and RDX registers for parameters, but sse* function uses RSI & RDI....
Delete