Wednesday, December 12, 2012

Compiling for T4

I've recently had quite a few queries about compiling for T4 based systems. So it's probably a good time to review what I consider to be the best practices.

  • Always use the latest compiler. Being in the compiler team, this is bound to be something I'd recommend :) But the serious points are that (a) Every release the tools get better and better, so you are going to be much more effective using the latest release (b) Every release we improve the generated code, so you will see things get better (c) Old releases cannot know about new hardware.
  • Always use optimisation. You should use at least -O to get some amount of optimisation. -xO4 is typically even better as this will add within-file inlining.
  • Always generate debug information, using -g. This allows the tools to attribute information to lines of source. This is particularly important when profiling an application.
  • The default target of -xtarget=generic is often sufficient. This setting is designed to produce a binary that runs well across all supported platforms. If the binary is going to be deployed on only a subset of architectures, then it is possible to produce a binary that only uses the instructions supported on these architectures, which may lead to some performance gains. I've previously discussed which chips support which architectures, and I'd recommend that you take a look at the chart that goes with the discussion.
  • Crossfile optimisation (-xipo) can be very useful - particularly when the hot source code is distributed across multiple source files. If you're allowed to have something as geeky as favourite compiler optimisations, then this is mine!
  • Profile feedback (-xprofile=[collect: | use:]) will help the compiler make the best code layout decisions, and is particularly effective with crossfile optimisations. But what makes this optimisation really useful is that codes that are dominated by branch instructions don't typically improve much with "traditional" compiler optimisation, but often do respond well to being built with profile feedback.
  • The macro flag -fast aims to provide a one-stop "give me a fast application" flag. This usually gives a best performing binary, but with a few caveats. It assumes the build platform is also the deployment platform, it enables floating point optimisations, and it makes some relatively weak assumptions about pointer aliasing. It's worth investigating.
  • SPARC64 processor, T3, and T4 implement floating point multiply accumulate instructions. These can substantially improve floating point performance. To generate them the compiler needs the flag -fma=fused and also needs an architecture that supports the instruction (at least -xarch=sparcfmaf).
  • The most critical advise is that anyone doing performance work should profile their application. I cannot overstate how important it is to look at where the time is going in order to determine what can be done to improve it.

I also presented at Oracle OpenWorld on this topic, so it might be helpful to review those slides.

Wednesday, December 5, 2012

Library order is important

I've written quite extensively about link ordering issues, but I've not discussed the interaction between archive libraries and shared libraries. So let's take a simple program that calls a maths library function:

#include <math.h>

int main()
{
  for (int i=0; i<10000000; i++)
  {
    sin(i);
  }
}

We compile and run it to get the following performance:

bash-3.2$ cc -g -O fp.c -lm
bash-3.2$ timex ./a.out

real           6.06
user           6.04
sys            0.01

Now most people will have heard of the optimised maths library which is added by the flag -xlibmopt. This contains optimised versions of key mathematical functions, in this instance, using the library doubles performance:

bash-3.2$ cc -g -O -xlibmopt fp.c -lm
bash-3.2$ timex ./a.out

real           2.70
user           2.69
sys            0.00

The optimised maths library is provided as an archive library (libmopt.a), and the driver adds it to the link line just before the maths library - this causes the linker to pick the definitions provided by the static library in preference to those provided by libm. We can see the processing by asking the compiler to print out the link line:

bash-3.2$ cc -### -g -O -xlibmopt fp.c -lm
/usr/ccs/bin/ld ... fp.o -lmopt -lm -o a.out...

The flag to the linker is -lmopt, and this is placed before the -lm flag. So what happens when the -lm flag is in the wrong place on the command line:

bash-3.2$ cc -g -O -xlibmopt -lm fp.c
bash-3.2$ timex ./a.out

real           6.02
user           6.01
sys            0.01

If the -lm flag is before the source file (or object file for that matter), we get the slower performance from the system maths library. Why's that? If we look at the link line we can see the following ordering:

/usr/ccs/bin/ld ... -lmopt -lm fp.o -o a.out 

So the optimised maths library is still placed before the system maths library, but the object file is placed afterwards. This would be ok if the optimised maths library were a shared library, but it is not - instead it's an archive library, and archive library processing is different - as described in the linker and library guide:

"The link-editor searches an archive only to resolve undefined or tentative external references that have previously been encountered."

An archive library can only be used resolve symbols that are outstanding at that point in the link processing. When fp.o is placed before the libmopt.a archive library, then the linker has an unresolved symbol defined in fp.o, and it will search the archive library to resolve that symbol. If the archive library is placed before fp.o then there are no unresolved symbols at that point, and so the linker doesn't need to use the archive library. This is why libmopt needs to be placed after the object files on the link line.

On the other hand if the linker has observed any shared libraries, then at any point these are checked for any unresolved symbols. The consequence of this is that once the linker "sees" libm it will resolve any symbols it can to that library, and it will not check the archive library to resolve them. This is why libmopt needs to be placed before libm on the link line.

This leads to the following order for placing files on the link line:

  • Object files
  • Archive libraries
  • Shared libraries

If you use this order, then things will consistently get resolved to the archive libraries rather than to the shared libaries.

Tuesday, December 4, 2012

It could be worse....

As "guest" pointed out, in my file I/O test I didn't open the file with O_SYNC, so in fact the time was spent in OS code rather than in disk I/O. It's a straightforward change to add O_SYNC to the open() call, but it's also useful to reduce the iteration count - since the cost per write is much higher:

...
#define SIZE 1024

void test_write()
{
  starttime();
  int file = open("./test.dat",O_WRONLY|O_CREAT|O_SYNC,S_IWGRP|S_IWOTH|S_IWUSR);
...

Running this gave the following results:

Time per iteration   0.000065606310 MB/s
Time per iteration   2.709711563906 MB/s
Time per iteration   0.178590114758 MB/s

Yup, disk I/O is way slower than the original I/O calls. However, it's not a very fair comparison since disks get written in large blocks of data and we're deliberately sending a single byte. A fairer result would be to look at the I/O operations per second; which is about 65 - pretty much what I'd expect for this system.

It's also interesting to examine at the profiles for the two cases. When the write() was trapping into the OS the profile indicated that all the time was being spent in system. When the data was being written to disk, the time got attributed to sleep. This gives us an indication how to interpret profiles from apps doing I/O. It's the sleep time that indicates disk activity.

Write and fprintf for file I/O

fprintf() does buffered I/O, where as write() does unbuffered I/O. So once the write() completes, the data is in the file, whereas, for fprintf() it may take a while for the file to get updated to reflect the output. This results in a significant performance difference - the write works at disk speed. The following is a program to test this:

#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <stdio.h>
#include <sys/time.h>
#include <sys/types.h>
#include <sys/stat.h>

static double s_time;

void starttime()
{
  s_time=1.0*gethrtime();
}

void endtime(long its)
{
  double e_time=1.0*gethrtime();
  printf("Time per iteration %5.2f MB/s\n", (1.0*its)/(e_time-s_time*1.0)*1000);
  s_time=1.0*gethrtime();
}

#define SIZE 10*1024*1024

void test_write()
{
  starttime();
  int file = open("./test.dat",O_WRONLY|O_CREAT,S_IWGRP|S_IWOTH|S_IWUSR);
  for (int i=0; i<SIZE; i++)
  {
    write(file,"a",1);
  }
  close(file);
  endtime(SIZE);
}

void test_fprintf()
{
  starttime();
  FILE* file = fopen("./test.dat","w");
  for (int i=0; i<SIZE; i++)
  {
    fprintf(file,"a");
  }
  fclose(file);
  endtime(SIZE);
}

void test_flush()
{
  starttime();
  FILE* file = fopen("./test.dat","w");
  for (int i=0; i<SIZE; i++)
  {
    fprintf(file,"a");
    fflush(file);
  }
  fclose(file);
  endtime(SIZE);
}


int main()
{
  test_write();
  test_fprintf();
  test_flush();
}

Compiling and running I get 0.2MB/s for write() and 6MB/s for fprintf(). A large difference. There's three tests in this example, the third test uses fprintf() and fflush(). This is equivalent to write() both in performance and in functionality. Which leads to the suggestion that fprintf() (and other buffering I/O functions) are the fastest way of writing to files, and that fflush() should be used to enforce synchronisation of the file contents.

Thursday, October 18, 2012

Mixing Java and native code

This was a bit of surprise to me. The slides are available from my presentation at JavaOne on mixed language development. What I wasn't expecting was that there would also be a video of the presentation.

Maximising SPARC T4 performance

My presentation from Oracle Open World is available for download.

Tuesday, October 9, 2012

25 years of SPARC

Looks like an interesting event at the Computer History Museum on 1 November. A panel discussing SPARC at 25: Past, Present and Future. Free sign up.

Friday, September 14, 2012

Current SPARC Architectures

Different generations of SPARC processors implement different architectures. The architecture that the compiler targets is controlled implicitly by the -xtarget flag and explicitly by the -arch flag.

If an application targets a recent architecture, then the compiler gets to play with all the instructions that the new architecture provides. The downside is that the application won't work on older processors that don't have the new instructions. So for developer's there is a trade-off between performance and portability.

The way we have solved this in the compiler is to assume a "generic" architecture, and we've made this the default behaviour of the compiler. The only flag that doesn't make this assumption is -fast which tells the compiler to assume that the build machine is also the deployment machine - so the compiler can use all the instructions that the build machine provides.

The -xtarget=generic flag tells the compiler explicitly to use this generic model. We work hard on making generic code work well across all processors. So in most cases this is a very good choice.

It is also of interest to know what processors support the various architectures. The following Venn diagram attempts to show this:


A textual description is as follows:

  • The T1 and T2 processors, in addition to most other SPARC processors that were shipped in the last 10+ years supported V9b, or sparcvis2.
  • The SPARC64 processors from Fujitsu, used in the M-series machines, added support for the floating point multiply accumulate instruction in the sparcfmaf architecture.
  • Support for this instruction also appeared in the T3 - this is called sparcvis3
  • Later SPARC64 processors added the integer multiply accumulate instruction, this architecture is sparcima.
  • Finally the T4 includes support for both the integer and floating point multiply accumulate instructions in the sparc4 architecture.

So the conclusion should be:

  • Floating point multiply accumulate is supported in both the T-series and M-series machines, so it should be a relatively safe bet to start using it.
  • The T4 is a very good machine to deploy to because it supports all the current instruction sets.

Thursday, August 30, 2012

SPARC Architecture 2011

With what appears to be minimal fanfare, an update of the SPARC Architecture has been released. If you ever look at SPARC disassembly code, then this is the document that you need to bookmark. If you are not familiar with it, then it basically describes how a SPARC processor should behave - it doesn't describe a particular implementation, just the "generic" processor. As with all revisions, it supercedes the SPARC v9 book published back in the 90s, having both corrections, and definitions of new instructions. Anyway, should be an interesting read :)

Wednesday, August 29, 2012

CON6714 - Mixed-Language Development: Leveraging Native Code from Java

Here's the abstract from my JavaOne talk:

There are some situations in which it is necessary to call native code (C/C++ compiled code) from Java applications. This session describes how to do this efficiently and how to performance-tune the resulting applications.

The objectives for the session are:

  • Explain reasons for using native code in Java applications
  • Describe pitfalls of calling native code from Java
  • Discuss performance-tuning of Java apps that use native code

I'll cover how to call native code from Java, debugging native code, and then I'll dig into performance tuning the code. The talk is not going too deep on performance tuning - focusing on the JNI specific topics; I'll do a bit more about performance tuning in my OpenWorld talk later in the day.

Monday, August 27, 2012

Moday 1 October: Presenting at JavaOne and Oracle OpenWorld

On Monday 1 October I will be presenting at both JavaOne and Oracle Open World. The full conference schedule is available from here. The logistics for my sessions are as follows:


  • JavaOne: 8:30am Monday 1 October. CON6714: "Mixed-Language Development: Leveraging Native Code from Java". San Francisco Hilton - Continental Ballroom 6
  • Oracle OpenWorld: 10:45am Monday 1 October. CON6382: "Maximizing Your SPARC T4 Oracle Solaris Application Performance". Marriott Marquis - Golden Gate C3

Hope to see you there!

Tuesday, May 22, 2012

Square roots

If you are spending significant time calling sqrt() then to improve this you should compile with -xlibmil. Here's some example code that calls both fabs() and sqrt():

#include <math.h>
#include <stdio.h>

int main()
{
  double d=23.3;
  printf("%f\n",fabs(d));
  printf("%f\n",sqrt(d));
}

If we compile this with Studio 12.2 we will see calls to both fabs() and fsqrt():

$ cc -S -O  m.c bash-3.2$ grep call m.s
$ grep call m.s|grep -v printf
/* 0x0018            */         call    fabs    ! params =  %o0 %o1     ! Result =  %f0 %f1
/* 0x0044            */         call    sqrt    ! params =  %o0 %o1     ! Result =  %f0 %f1

If we add -xlibmil then these calls get replaced by equivalent instructions:

$ cc -S -O -xlibmil  m.c
$ grep abs m.s|grep -v print; grep sqrt m.s|grep -v print
/* 0x0018          7 */         fabsd   %f4,%f0
/* 0x0038            */         fsqrtd  %f6,%f2

The default for Studio 12.3 is to inline fabs(), but you still need to add -xlibmil for the compiler to inline fsqrt(), so it is a good idea to include the flag.

You can see the functions that are replaced by inline versions by grepping the inline template file (libm.il) for the word "inline":

$ grep inline libm.il
        .inline sqrtf,1
        .inline sqrt,2
        .inline ceil,2
        .inline ceilf,1
        .inline floor,2
        .inline floorf,1
        .inline rint,2
        .inline rintf,1
        .inline min_subnormal,0
        .inline min_subnormalf,0
        .inline max_subnormal,0
...

The caveat with -xlibmil is documented:

          However, these substitutions can cause the setting of
          errno to become unreliable. If your program depends on
          the value of errno, avoid this option. See the NOTES
          section at the end of this man page for more informa-
          tion.

An optimisation in the inline versions of these functions is that they do not set errno. Which can be a problem for some codes, but most codes don't read errno.

Monday, April 23, 2012

sincos()

If you are computing both the sine and cosine of an angle, then you will be twice as quick if you call sincos() than if you call cos() and sin() independently:

#include 

int main()
{
  double a,b,c;
  a=1.0;
  for (int i=0;i<100000000;i++) { b=sin(a); c=cos(a); }
}

$ cc -O sc.c -lm
$ timex ./a.out
real          19.13

vs

#include 

int main()
{
  double a,b,c;
  a=1.0;
  for (int i=0;i<100000000;i++) { sincos(a,&b,&c); }
}
$ cc -O sc.c -lm
$ timex ./a.out
real           9.80

Friday, April 20, 2012

What is -xcode=abs44?

I've talked about building 64-bit libraries with position independent code. When building 64-bit applications there are two options for the code that the compiler generates: -xcode=abs64 or -xcode=abs44, the default is -xcode=abs44. These are documented in the user guides. The abs44 and abs64 options produce 64-bit applications that constrain the code + data + BSS to either 44 bit or 64 bits of address.

These options constrain the addresses statically encoded in the application to either 44 or 64 bits. It does not restrict the address range for pointers (dynamically allocated memory) - they remain 64-bits. The restriction is in locating the address of a routine or a variable within the executable.

This is easier to understand from the perspective of an example. Suppose we have a variable "data" that we want to return the address of. Here's the code to do such a thing:

extern int data;

int * address()
{
  return &data;
}

If we compile this as a 32-bit app we get the following disassembly:

/* 000000          4 */         sethi   %hi(data),%o5
/* 0x0004            */         retl    ! Result =  %o0
/* 0x0008            */         add     %o5,%lo(data),%o0

So it takes two instructions to generate the address of the variable "data". At link time the linker will go through the code, locate references to the variable "data" and replace them with the actual address of the variable, so these two instructions will get modified. If we compile this as a 64-bit code with full 64-bit address generation (-xcode=abs64) we get the following:

/* 000000          4 */         sethi   %hh(data),%o5
/* 0x0004            */         sethi   %lm(data),%o2
/* 0x0008            */         or      %o5,%hm(data),%o4
/* 0x000c            */         sllx    %o4,32,%o3
/* 0x0010            */         or      %o3,%o2,%o1
/* 0x0014            */         retl    ! Result =  %o0
/* 0x0018            */         add     %o1,%lo(data),%o0

So to do the same thing for a 64-bit application with full 64-bit address generation takes 6 instructions. Now, most hardware cannot address the full 64-bits, hardware typically can address somewhere around 40+ bits of address (example). So being able to generate a full 64-bit address is currently unnecessary. This is where abs44 comes in. A 44 bit address can be generated in four instructions, so slightly cuts the instruction count without practically compromising the range of memory that an application can address:

/* 000000          4 */         sethi   %h44(data),%o5
/* 0x0004            */         or      %o5,%m44(data),%o4
/* 0x0008            */         sllx    %o4,12,%o3
/* 0x000c            */         retl    ! Result =  %o0
/* 0x0010            */         add     %o3,%l44(data),%o0

Monday, April 2, 2012

Efficient inline templates and C++

I've talked before about calling inline templates from C++, I've also talked about calling inline templates efficiently. This time I want to talk about efficiently calling inline templates from C++.

The obvious starting point is that I need to declare the inline templates as being extern "C":

  extern "C"
  {
    int mytemplate(int);
  }

This enables us to call it, but the call may not be very efficient because the compiler will treat it as a function call, and may produce suboptimal code based on that premise. So we need to add the no_side_effect pragma:

  extern "C"
  {
    int mytemplate(int); 
    #pragma no_side_effect(mytemplate)
  }

However, this may still not produce optimal code. We've discussed how the no_side_effect pragma cannot be combined with exceptions, well we know that the code cannot produce exceptions, but the compiler doesn't know that. If we tell the compiler that information it may be able to produce even better code. We can do this by adding the "throw()" keyword to the template declaration:

  extern "C"
  {
    int mytemplate(int) throw(); 
    #pragma no_side_effect(mytemplate)
  }

The following is an example of how these changes might improve performance. We can take our previous example code and migrate it to C++, adding the use of a try...catch construct:

#include <iostream>

extern "C"
{
  int lzd(int);
  #pragma no_side_effect(lzd)
}

int a;
int c=0;

class myclass
{
  int routine();
};

int myclass::routine()
{
  try
  {
    for(a=0; a<1000; a++)
    {
      c=lzd(c);
    }
  }
  catch(...)
  {
    std::cout << "Something happened" << std::endl;
  }
 return 0;
}

Compiling this produces a slightly suboptimal code sequence in the hot loop:

$ CC -O -xtarget=T4 -S t.cpp t.il
...
/* 0x0014         23 */         lzd     %o0,%o0
/* 0x0018         21 */         add     %l6,1,%l6
/* 0x001c            */         cmp     %l6,1000
/* 0x0020            */         bl,pt   %icc,.L77000033
/* 0x0024         23 */         st      %o0,[%l7]

There's a store in the delay slot of the branch, so we're repeatedly storing data back to memory. If we change the function declaration to include "throw()", we get better code:

$ CC -O -xtarget=T4 -S t.cpp t.il
...
/* 0x0014         21 */         add     %i1,1,%i1
/* 0x0018         23 */         lzd     %o0,%o0
/* 0x001c         21 */         cmp     %i1,999
/* 0x0020            */         ble,pt  %icc,.L77000019
/* 0x0024            */         nop

The store has gone, but the code is still suboptimal - there's a nop in the delay slot rather than useful work. However, it's good enough for this example. The point I'm making is that the compiler produces the better code with both the "throw()" and the no side effect pragma.

Pragmas and exceptions

The compiler pragmas:

  #pragma no_side_effect(routinename)
  #pragma does_not_write_global_data(routinename)
  #pragma does_not_read_global_data(routinename)

are used to tell the compiler more about the routine being called, and enable it to do a better job of optimising around the routine. If a routine does not read global data, then global data does not need to be stored to memory before the call to the routine. If the routine does not write global data, then global data does not need to be reloaded after the call. The no side effect directive indicates that the routine does no I/O, does not read or write global data, and the result only depends on the input.

However, these pragmas should not be used on routines that throw exceptions. The following example indicates the problem:

#include <iostream>

extern "C"
{
  int exceptional(int);
  #pragma no_side_effect(exceptional)
}

int exceptional(int a)
{
  if (a==7)
  {
    throw 7;
  }
  else
  {
   return a+1;
  } 
}


int a;
int c=0;

class myclass
{
  public:
  int routine();
};

int myclass::routine()
{
  for(a=0; a<1000; a++)
  {
    c=exceptional(c);
  }
 return 0;
}

int main()
{
  myclass f;
  try
  {
    f.routine();
  }
  catch(...)
  {
    std::cout << "Something happened" << a << c << std::endl;
  }
  
}

The routine "exceptional" is declared as having no side effects, however it can throw an exception. The no side effects directive enables the compiler to avoid storing global data back to memory, and retrieving it after the function call, so the loop containing the call to exceptional is quite tight:

$ CC -O -S test.cpp
...
                        .L77000061:
/* 0x0014         38 */         call    exceptional     ! params =  %o0 ! Result =  %o0
/* 0x0018         36 */         add     %i1,1,%i1
/* 0x001c            */         cmp     %i1,999
/* 0x0020            */         ble,pt  %icc,.L77000061
/* 0x0024            */         nop

However, when the program is run the result is incorrect:

$ CC -O t.cpp
$ ./a.out
Something happend00

If the code had worked correctly, the output would have been "Something happened77" - the exception occurs on the seventh iteration. Yet, the current code produces a message that uses the original values for the variables 'a' and 'c'.

The problem is that the exception handler reads global data, and due to the no side effects directive the compiler has not updated the global data before the function call. So these pragmas should not be used on routines that have the potential to throw exceptions.

Friday, March 30, 2012

Inline template efficiency

I like inline templates, and use them quite extensively. Whenever I write code with them I'm always careful to check the disassembly to see that the resulting output is efficient. Here's a potential cause of inefficiency.

Suppose we want to use the mis-named Leading Zero Detect (LZD) instruction on T4 (this instruction does a count of the number of leading zero bits in an integer register - so it should really be called leading zero count). So we put together an inline template called lzd.il looking like:

.inline lzd
  lzd %o0,%o0
.end

And we throw together some code that uses it:

int lzd(int);

int a;
int c=0;

int main()
{
  for(a=0; a<1000; a++)
  {
    c=lzd(c);
  }
  return 0;
}

We compile the code with some amount of optimisation, and look at the resulting code:

$ cc -O -xtarget=T4 -S lzd.c lzd.il
$ more lzd.s
                        .L77000018:
/* 0x001c         11 */         lzd     %o0,%o0
/* 0x0020          9 */         ld      [%i1],%i3
/* 0x0024         11 */         st      %o0,[%i2]
/* 0x0028          9 */         add     %i3,1,%i0
/* 0x002c            */         cmp     %i0,999
/* 0x0030            */         ble,pt  %icc,.L77000018
/* 0x0034            */         st      %i0,[%i1]

What is surprising is that we're seeing a number of loads and stores in the code. Everything could be held in registers, so why is this happening?

The problem is that the code is only inlined at the code generation stage - when the actual instructions are generated. Earlier compiler phases see a function call. The called functions can do all kinds of nastiness to global variables (like 'a' in this code) so we need to load them from memory after the function call, and store them to memory before the function call.

Fortunately we can use a #pragma directive to tell the compiler that the routine lzd() has no side effects - meaning that it does not read or write to memory. The directive to do that is #pragma no_side_effect(<routine name>), and it needs to be placed after the declaration of the function. The new code looks like:

int lzd(int);
#pragma no_side_effect(lzd)

int a;
int c=0;

int main()
{
  for(a=0; a<1000; a++)
  {
    c=lzd(c);
  }
  return 0;
}

Now the loop looks much neater:

/* 0x0014         10 */         add     %i1,1,%i1

!   11                !  {
!   12                !    c=lzd(c);

/* 0x0018         12 */         lzd     %o0,%o0
/* 0x001c         10 */         cmp     %i1,999
/* 0x0020            */         ble,pt  %icc,.L77000018
/* 0x0024            */         nop

Wednesday, February 8, 2012

POSIX threading API on Windows

When I was writing the Multicore book, I was struck by the commonality of the Windows [1][2] and POSIX threading APIs. The concepts were common, but the names (and parameters) of the function calls were different. It looked like a relatively trivial task to write a shim to translate between the two. However, I had a book to write, so it languished until one evening last week when I actually got around to doing it.

The result is these two files winpthread.h and winpthread.cpp. I don't claim that these are a high fidelity port, there's plenty missing, but they should be sufficient to get a code working. In fact, there's very little complexity in the code, the only caveat is that condition variables were not introduced in Windows until Vista, and these require the use of critical sections rather than mutexes. The difference between mutexes and critical sections is that critical sections are local to a single processes whereas mutexes can be shared between processes.

Let me know if there's bugs, any improvements, or if the code is useful.

Friday, February 3, 2012

Using prtpicl to get cache sizes

If you are on a SPARC system you can get cache size information using the command fpversion, which is provided with Studio:

$ fpversion
 A SPARC-based CPU is available.
 Kernel says main memory's clock rate is 1012.0 MHz.

 Sun-4 floating-point controller version 0 found.
 An UltraSPARC chip is available.

 Use "-xtarget=sparc64vii -xcache=64/64/2:5120/256/10" code-generation option.

The cache parameters are output exactly as you would want to pass them into the compiler - for each cache it describes the size in KB, the line size in bytes, and the associativity.

fpversion doesn't exist on x86 systems. The next best thing is to use prtpicl to output system configuration information, and inspect that output for cache size. Here's the cache output for the same SPARC system using prtpicl.

$ prtpicl -v |grep cache
              :l1-icache-size    0x10000
              :l1-icache-line-size       0x40
              :l1-icache-associativity   0x2
              :l1-dcache-size    0x10000
              :l1-dcache-line-size       0x40
              :l1-dcache-associativity   0x2
              :l2-cache-size     0x500000
              :l2-cache-line-size        0x100
              :l2-cache-associativity    0xa

Tuesday, January 17, 2012

Separation of debug and executable

To reduce the size of shipped binaries it can be useful to separate the debug information into a separate file. This procedure is covered in the dbx manual. We can use objdump to extract the debug information and then to link the executable with the extracted data.

Here's a short example executable:

#include <stdio.h>
#include <math.h>

int main()
{
  double d=1.0;
  d = sin(d);
  printf("sin(1.0) = %f\n",d);
}

Compiled with debug:

$ cc -g hello.c -lm
$ ./a.out
sin(1.0) = 0.841471

We can debug this executable with dbx. Note that, in this case, we compiled without optimisation in order to get the best debug information. Doing this does potentially sacrifice some performance. We can follow the same procedure with optimised code.

$ dbx ./a.out
Reading ld.so.1
Reading libm.so.2
Reading libc.so.1
(dbx) stop in main
(2) stop in main
(dbx) run
Running: a.out
(process id 53296)
stopped in main at line 6 in file "hello.c"
    6     double d=1.0;
(dbx) step
stopped in main at line 7 in file "hello.c"
    7     d = sin(d);
(dbx) print d
d = 1.0
(dbx) cont
Reading libc_psr.so.1
sin(1.0) = 0.841471

First of all we are going to use objcopy to extract the debug information from ./a.out and place it into ./a.out.debug:

$ /usr/sfw/bin/gobjcopy --only-keep-debug ./a.out ./a.out.debug

Now we can strip a.out of debug information:

$ strip ./a.out

To prove that this has removed the debug information we can try running under dbx:

$ dbx  ./a.out
Reading ld.so.1
Reading libm.so.2
Reading libc.so.1
(dbx) stop in main
dbx: warning: 'main' has no debugger info -- will trigger on first instruction
(2) stop in main
(dbx) quit

Now we want to use objcopy to make a link between the executable and its debug information:

$ /usr/sfw/bin/gobjcopy --add-gnu-debuglink=./a.out.debug ./a.out

Now when we debug the executable we are back to full debug:

$ dbx ./a.out
Reading ld.so.1
Reading libm.so.2
Reading libc.so.1
(dbx) stop  in main
(2) stop in main
(dbx) run
Running: a.out
(process id 58837)
stopped in main at line 6 in file "hello.c"
    6     double d=1.0;
(dbx) next
stopped in main at line 7 in file "hello.c"
    7     d = sin(d);
(dbx) print d
d = 1.0
(dbx) cont
Reading libc_psr.so.1
sin(1.0) = 0.841471

execution completed, exit code is 0
(dbx) quit

Friday, January 13, 2012

C++ and inline templates

A while back I wrote an article on using inline templates. It's a bit of a niche article as I would generally advise people to write in C/C++, and tune the compiler flags and source code until the compiler generates the code that they want to see.

However, one thing that I didn't mention in the article, it's implied but not stated, is that inline templates are defined as C functions. When used from C++ they need to be declared as extern "C", otherwise you get linker errors. Here's an example template:

.inline nothing
  nop
.end

And here's some code that calls it:

void nothing();

int main()
{
  nothing();
}

The code works when compiled as C, but not as C++:

$ cc i.c i.il
$ ./a.out
$ CC i.c i.il
Undefined                       first referenced
 symbol                             in file
void nothing()                   i.o
ld: fatal: Symbol referencing errors. No output written to a.out

To fix this, and make the code compilable with both C and C++ we use the __cplusplus feature test macro and conditionally include extern "C". Here's the modified source:

#ifdef __cplusplus
  extern "C"
  {
#endif
    void nothing();
#ifdef __cplusplus
  }
#endif

int main()
{
  nothing();
}

Thursday, January 12, 2012

Please mind the gap

I find the timeline view in the Performance Analyzer incredibly useful, but I've often been puzzled by what causes the gaps - like those in the example below:

Timeline view

One of my colleagues pointed out that it is possible to figure out what is causing the gaps. The call stack is indicated by the event after the gap. This makes sense. The Performance Analyzer works by sending a profiling signal to the thread multiple times a second. If the thread is not scheduled on the CPU then it doesn't get a signal. The first thing that the thread does when it is put back onto the CPU is to respond to those signals that it missed. Here's some example code so that you can try it out.

#include <stdio.h>

void write_file()
{
  char block[8192];
  FILE * file = fopen("./text.txt", "w");
  for (int i=0;i<1024; i++)
  {
    fwrite(block, sizeof(block), 1, file);
  }
  fclose(file);
}

void read_file()
{
  char block[8192];
  FILE * file = fopen("./text.txt", "rw");
  for (int i=0;i<1024; i++)
  {
    fread(block,sizeof(block),1,file);
    fseek(file,-sizeof(block),SEEK_CUR);
    fwrite(block, sizeof(block), 1, file);
  }
  fclose(file);
}

int main()
{
  for (int i=0; i<100; i++)
  {
    write_file();
    read_file();
  }
}

This is the code that generated the timeline shown above, so you know that the profile will have some gaps in it. If we select the event after the gap we determine that the gaps are caused by the application either opening or closing the file.

_close

But that is not all that is going on, if we look at the information shown in the Timeline details panel for the Duration of the event we can see that it spent 210ms in the "Other Wait" micro state. So we've now got a pretty clear idea of where the time is coming from.

Monday, January 9, 2012

A static function, an inline function, and a static variable walked into a bar....

... well, not really. Hacking around with some library code, so I thought I'd write up a quick refresher on scoping. Steve Clamage and I cover scoping in more detail in the series on libraries and linking. For the code I was working on today, the problem was much more limited.

I had a single file containing all the source code. I wanted to export only the minimal number of symbols that were needed to act as an interface for the library. You can imagine it being something like:

#include <stdio.h>

int count=0;

inline void printcount()
{
  printf("Count = %i\n",count);
  asm("nop");
}

void next()
{
  count++;
  printcount();
}

If I compile this, and then use nm to inspect the resulting library, I can see a global symbol for count. The function printcount() is defined with local scope. However, the only interface I want to export is next().

bash-3.00$ cc -g -G -O -o libt.so t.c
bash-3.00$ nm libt.so|grep GLOB
...
[45]    |     66468|       4|OBJT |GLOB |0    |11     |count
[43]    |       724|      40|FUNC |GLOB |0    |5      |next
[42]    |         0|       0|FUNC |GLOB |0    |UNDEF  |printf
bash-3.00$ nm libt.so |grep count
[44]    |     66460|       4|OBJT |GLOB |0    |11     |count
[32]    |       672|      52|FUNC |LOCL |0    |5      |printcount

So I can define count as a static variable, and that reduces its scope to the file in which it is defined. However, this does not actually make it disappear, it is still there, but with name mangling:

bash-3.00$ nm libt.so|grep count
[40]    |     66476|       4|OBJT |GLOB |0    |11     |$XAS4IkBuA_CPGtc.count
[33]    |       688|      52|FUNC |LOCL |0    |5      |printcount

The reason for this is that I'm building with debug (-g). With debug, I get a local version of the routine printcount(), and I get a globalised version of the variable count. If I remove -g, I get the following output from nm:

bash-3.00$ nm libt.so|grep count
[29]    |     66316|       4|OBJT |LOCL |0    |11     |count
[36]    |         0|       0|FUNC |GLOB |0    |UNDEF  |printcount

The variable count has local scope, which is what we expected - it is no longer exported from the file, so we have avoided possible name conflicts there. However, printcount() is now no longer defined. That might be ok so long as we don't actually call the routine:

bash-3.00$ dis libt.so|grep printcount
printcount()
         2e4:  7f ff ff ef  call        printcount      ! 0x2a0

Oops. We've hit the rule about needing to provide an extern version of any inline functions. Once again, I suggest parsing Douglas Walls' discussion of the topic for the gory details. Anyhow, the upshot is that this library wouldn't work. The fix is trivial, declare printcount() to be static inline, and the compiler will generate the local version of the function:

bash-3.00$ cc -G -O -o libt.so t.c
bash-3.00$ nm libt.so |grep count
[29]    |     66448|       4|OBJT |LOCL |0    |11     |count
[30]    |       664|      52|FUNC |LOCL |0    |5      |printcount

With these fixes the library no longer exports any functions but the ones I left with external linkage. This substantially reduces the risk of "undefined behaviour".

Understanding binary size

One of my colleagues, Miriam Blatt, has written a great article about understanding the size of binary objects. This is worth a read because it describes both what goes into the objects and what tools you can use to discover this information.

What's inlined by -xlibmil

The compiler flag -xlibmil provides inline templates for some critical maths functions, but it comes with the optimisation that it does not set errno for these functions. The functions it inlines can vary from release to release, so it's useful to be able to see which functions are inlined, and determine whether you care that they don't set errno. You can see the list of functions using the command:

grep inline /compilerpath/prod/lib/libm.il
        .inline sqrtf,1
        .inline sqrt,2
        .inline ceil,2
        .inline ceilf,1
        .inline floor,2
        .inline floorf,1
        .inline rint,2
        .inline rintf,1
...

From a cursory glance at the list I got when I did this just now, I can only see sqrt as a function that sets errno. So if you use sqrt and you care about whether it set errno, then don't use -xlibmil.