SPO600

Wednesday, 16 April 2014

The Last Ditch Effort, Pt. 2

To my great dismay, I'm finding myself riddled with more roadblocks to a successful build and benchmark on the aarch64 port of Gnash as the semester comes to a close:

Word has reached me from the community regarding some of the inline assembly in "jemalloc.c". Apparently, as I've been researching, they integrated an updated version of the memory allocation algorithm into Gnash that already includes aarch64 implementation logic. Their github repository can be found and cloned using this link. The "pause" instructions for the cpu_spinwait loop logic blocks remain the same, but were moved to the configure.ac file, while a new memory allocation value for aarch64 was added as a pre-processor directive block in "jemalloc_internal.h.in" located in the include/jemalloc/internal directory:

# ifdef __aarch64__

# define LG_QUANTUM 4

# endif

From reading the API's documentation, it simply serves to appropriately declare the minimum or ideally efficient amount of bytes for memory allocation use cases and is utilized in conjunction with other tools such as Valgrind for memory leak detection, memory allocation debugging, and profiling. The aforementioned block of code is one of many for various architectures. Below is a quoted paragraph from the implementation notes of the API that seem most relevant to the logic behind this code block:

"Traditionally, allocators have used sbrk(2) to obtain memory, which is suboptimal for several reasons, including race conditions, increased fragmentation, and artificial limitations on maximum usable memory. If --enable-dss is specified during configuration, this allocator uses both mmap(2)and sbrk(2), in that order of preference; otherwise only mmap(2) is used.
This allocator uses multiple arenas in order to reduce lock contention for threaded programs on multi-processor systems. This works well with regard to threading scalability, but incurs some costs. There is a small fixed per-arena overhead, and additionally, arenas manage memory completely independently of each other, which means a small fixed increase in overall memory fragmentation. These overheads are not generally an issue, given the number of arenas normally used. Note that using substantially more arenas than the default is not likely to improve performance, mainly due to reduced cache performance. However, it may make sense to reduce the number of arenas if an application does not make much use of the allocation functions.
In addition to multiple arenas, unless --disable-tcache is specified during configuration, this allocator supports thread-specific caching for small and large objects, in order to make it possible to completely avoid synchronization for most allocation requests. Such caching allows very fast allocation in the common case, but it increases memory usage and fragmentation, since a bounded number of objects can remain allocated in each thread cache.
Memory is conceptually broken into equal-sized chunks, where the chunk size is a power of two that is greater than the page size. Chunks are always aligned to multiples of the chunk size. This alignment makes it possible to find metadata for user objects very quickly.
User objects are broken into three categories according to size: small, large, and huge. Small objects are smaller than one page. Large objects are smaller than the chunk size. Huge objects are a multiple of the chunk size. Small and large objects are managed by arenas; huge objects are managed separately in a single data structure that is shared by all threads. Huge objects are used by applications infrequently enough that this single data structure is not a scalability issue.
Each chunk that is managed by an arena tracks its contents as runs of contiguous pages (unused, backing a set of small objects, or backing one large object). The combination of chunk alignment and chunk page maps makes it possible to determine all metadata regarding small and large allocations in constant time.
Small objects are managed in groups by page runs. Each run maintains a frontier and free list to track which regions are in use. Allocation requests that are no more than half the quantum (8 or 16, depending on architecture) are rounded up to the nearest power of two that is at least sizeof(double). All other small object size classes are multiples of the quantum, spaced such that internal fragmentation is limited to approximately 25% for all but the smallest size classes. Allocation requests that are larger than the maximum small size class, but small enough to fit in an arena-managed chunk (see the"opt.lg_chunk" option), are rounded up to the nearest run size. Allocation requests that are too large to fit in an arena-managed chunk are rounded up to the nearest multiple of the chunk size.
Allocations are packed tightly together, which can be an issue for multi-threaded applications. If you need to assure that allocations do not suffer from cacheline sharing, round your allocation requests up to the nearest multiple of the cacheline size, or specify cacheline alignment when allocating.
Assuming 4 MiB chunks, 4 KiB pages, and a 16-byte quantum on a 64-bit system, the size classes in each category are as shown in Table 1."

Table 1. Size classes

Category	Spacing	Size
Small	lg	[8]
	16	[16, 32, 48, ..., 128]
	32	[160, 192, 224, 256]
	64	[320, 384, 448, 512]
	128	[640, 768, 896, 1024]
	256	[1280, 1536, 1792, 2048]
	512	[2560, 3072, 3584]
Large	4 KiB	[4 KiB, 8 KiB, 12 KiB, ..., 4072 KiB]
Huge	4 MiB	[4 MiB, 8 MiB, 12 MiB, ...]

========================================================================

I have yet to hear from the community in relation to my further inquiry about the minimalist gnu for windows asm block in "utility.h" and whether or not there is a way or a necessity to port this operation into aarch64. The code block can be seen below:

#ifndef __MINGW32__

#undef assert

#define assert(x) if (!x)) { __asm { int 3 } }

#endif

From my own research, there is a way to build MingW on cross-platforms for whatever required processor, via these how-to instructions that effectively tell you to download the required libraries and change the specific target variables to your particular processor/architecture. Either way, the arm instruction equivalent that I would have tested would have been a "BRK 0" instruction replacing the "INT 3" Intel x64 call for a debugging breakpoint, as mentioned in my earlier march summary recap post. Unfortunately, I cannot test any of this proposed logic due to the dependency issues in the build, discussed below.

My biggest trouble has been trying to simply run the configure and make files on the qemu environment. The purported development version of Gnash is currently 0.8.11, yet in any of the channels that I've tried to download the source code from, including the repository clone from github, the repository from fedpkg, and a few Canadian FTP Mirrors for all GNU-related packages, everything stops at 0.8.10 even though 0.8.11's changelog is already posted up on Gnash's Wiki page.

Regardless, after running the ./configure command with the added parameter of changing the build to aarch64 in the package's present state ("./configure --build=aarch64-unknown-linux"), I cannot install one crucial missing dependency on the qemu environment according to the resulting output - that being xulrunner-devel. This is a mozilla-related package that according to this Bugzilla discussion has still yet to be upstreamed in an aarch64-compatible version as of the 10th of April. This is further proven with the lack of any matches in relevant yum repositories of the package when trying to yum install or run a yum search all query.

In conclusion, with the added scheduling and time constraints put on by the 14-week curriculum of the semester structure and the workload of other classes, I was not able to achieve the desired and ideal amount of work in order to achieve the optimal amount of progress for a package with the amount of obstacles I have come across such as Gnash. My hope is that my research and work herein will serve as a potentially useful tool for the rest of the community going forward in the effort to successfully build, run, and optimize this package for 64-bit ARM implementations. I will upload a public github repository of my environment in its current state and post a link to it on this blog at the community's behest.

Saturday, 12 April 2014

The Last Ditch Effort, Pt. 1

After presenting the current state of my work on the Gnash package, I am currently back to solving the build dependency debacle before inserting the assembly translations into the package code. "yum-builddep gnash.spec" immediately proved unsuccessful, so after a few attempts at troubleshooting via google I went ahead and used the piped sed command combination generously given to me by Chris Tyler:

"yum install -y $(cat *spec | sed -n "s/^BuildRequires://p" | sed "s/>=.*//")"

Effectively, this stream editing command parsed through the gnash.spec file and searched for any and all build dependency packages written after the "BuildRequires:" string using the regular expressions embedded within, and finally redirected the output to the appropriate portion of the yum-install command.

As far as my research of the found assembly code, the code in the affected files - "jemalloc.c" and "utility.h" wrap the pre-processor blocks in the constraints "#ifdef __i386__", "#ifdef __amd64__", and "#ifndef __MINGW32__" respectively. Unfortunately, due to google being of very little help in such niche specializations, I am currently awaiting word from the community if any of this logic needs to be reapplied for aarch64 - especially the memory allocation logic of the blocks in jemalloc.c, for example:

#ifdef __arm__

# define QUANTUM_2POW_MIN 4

# define SIZEOF_PTR_2POW 3

# define NO_TLS

#endif

More to come in the next few days.

Tuesday, 1 April 2014

Comprehensive March Recap

As expected, the latter cadence of the current semester has been nothing short of chaotic with the general relegation of project work and test preparation, so an apology is due on behalf of yours truly for not consistently updating on the progress of the software package port. The progress below will be chronologically (to the best of my ability) subdivided into relevant sections in order to best reference any and all steps, obstacles, and solutions that have been encountered over the past month:

Acquisition of the package -

Shortly after the codebase analysis lab (found here) covering Unix commands, file extensions, and directories that were and are useful in the importation of software packages and the search for assembly code and its implications, a list of software packages and libraries carried off of Linaro's Linux on ARM64 porting project was posted on the course wiki in order to give students a reasonable amount of time to decide on an appropriate package that satisfies the prescribed requirements (http://zenit.senecac.on.ca/wiki/index.php/Winter_2014_SPO600_Software_List). After careful consideration that includes downloading the package, analyzing the scope of the work involved in order to successfully be able to either port or eliminate any assembly-related dependencies to the new architecture within the time constraints of the semester, my sights were set on the GNU foundation's open source flash decoder package - Gnash.

About Gnash -

The GNU flash player. An open source, web-based video and audio decoder dealing mostly with .swf files, up to version 9. The software is based on a library called gameswf. Any other details regarding the software's overview are found on its homepage. The developement community is rather scarce at this point, with one individual on their IRC channel pointing me to the mailing list as being the last bastion of community discussion in relation to bugs or patches. After having subscribed to it for at least a month, there has been no ongoing activity whatsoever, unfortunately.

Local Environment Setup -

Given the options provided by Chris Tyler for analysis, implementation, and benchmarking, as well as the constraints of running Windows 8 natively on my local machine before embarking on this course, my initial task was to install Fedora 20 on a virtual machine interface. Using Oracle's Virtualbox software, a quick Google search ended at a very useful Wikihow Article on an efficient way to install fedora on windows using the Oracle VM.

Installing the 64-bit ARM emulator available to us on the Ireland remote server on the local machine seemed like the next ideal task to accomplish and was achieved with relative ease using the SPO600 Wiki's instructions at the bottom of the Qemu overview page. After a slight misunderstanding of where to properly transfer the "qemu-arm64.conf" file to the local machine (the etc/binfmt.d directory path relative to root and not the newly created arm64 directory used for the emulator) and some fumbling around with the tar command options for unpacking compressed version of the environment, it was mostly straight-forward.

Lastly, downloading and installing the Fedora AArch64 Foundation Model was a slightly odd experience, with the biggest stumble occuring when trying to download the files from the arm64 website and not seeing the actual "download" link appear on the bottom of the page. This was promptly solved by reverting to the Windows machine to download the needed file and e-mail it to myself back on the virtual machine.

The x86_64 Build and First Bugfix -

After peering, perusing around, and finding the assembly code necessary to begin the research and work involved into directly translating and compiling the package on Qemu, the first attempt at building Gnash on x86 proved to be unsuccessful after a 30 minute delay. The culprit was determined to be a nullptr exception variant in two cases in the npruntime.h file. Simply opening the file and editing the code manually proved sufficient enough for the build to be successful on the 2nd run through.

Ongoing aarch64 Porting Implementation -

Using the recursive grep command on patterns containing "asm" "__asm" or "(asm", the output resulted in the following results:

gnash-0.8.10/libbase/jemalloc.c:# define CPU_SPINWAIT __asm__ volatile("pause")
gnash-0.8.10/libbase/jemalloc.c:# define CPU_SPINWAIT __asm__ volatile("pause")

gnash-0.8.10/libbase/utility.h:#define assert(x) if (!(x)) { __asm { int 3 } }

With professor Tyler's advice, as well as deductive research in the ARM ISO, I have managed to narrow down the aarch64 alternatives to be a BRK 0 instruction as a replacement for the "int 3" debugging breakpoint macro, and the "pause" command will be substituted for a "YIELD" command on the Qemu environment in relation to spin/wait loops.

I am using a few references to look further into these operations and their purpose:

Stack Overflow Query
INT (x86_instruction) - Wikipedia
ARMv8 Instruction Set Overview and Developer Manual

This is currently in the progress of being implemented and tested.

Miscellaneous Calamities -

A mistake of seemingly seismic proportion occurred on my part on March 27th. In the middle of a lecture, a notice to update the operating system popped up at the bottom of my screen. When an update process is ongoing, it effectively downloads the new dependencies first before deleting the old ones. Whether or not this is a symptom of default settings on the virtual machine or not needs to be looked into further, but nevertheless the update process was likely interrupted by the virtual machine powering off right in the middle of downloading and installing new versions of packaging and deleting the older versions akin to what was just installed, ultimately resulting in a GUI-related package called "gdm" that contained duplicate dependencies and other packages using obsolete versions which translate to the operating system being unable to properly load any of the user interface layers that aren't the terminal environment. The subsequent 4 days following that consisted of complete and utter mental stress, chaos, and anguish, only to be solved by (yet again) the technical expertise and prowess of Chris Tyler. The residing lesson here would be never to update your virtual machine's operating system without unassailable confidence that you know and have complete control over the processes that will be executing as well as their timings.

Monday, 27 January 2014

Initial Experimentation with Cross-Platform Assembler

In this particular exercise, there were two parts to be compiled and executed on two respective platforms - x86/64 and ARM64. Trivial sounding at first, but much more intricate and procedure intensive when labored through, the tasks were to initially create a 10 iteration loop that is outputted as a character string that shows the ascending integer of the iterator. This would be followed by expanding the loop to 30 iterations and presenting the incremented result in a "double digit" format that combines both the quotient and the remainder of the number when divided by 10.

Our group ran into a handful of obstacles throughout the process of completing both parts of the lab, and I have taken the liberty of jotting them down as we encountered them:

- Throughout the first part, we constantly fiddled around with the .data variable lengths and calculations in order to insert the result byte at the correct index of the prescribed string.

- A recurring error we ran into was a segmentation fault involving the index variable, with my first guess assuming that this was attributed to the declaration syntax. Later fixed with the inclusion of parentheses.

- Stepping through the execution with the GNU debugger (gdb) gave us a weird value at register 12 at first.This was corrected by changing the use to the right register (rdx instead rcx).

- The format came out wonky at first, the newline character did not get executed. The correction to this came with Nicholas's logic to include the calculations in the loop on where to store the byte in the string.

- In the Aarch64 port, the first main problem was trying to load the "index" variable to a register, since we couldn't find a replacement for the x86 equivalent of a -b suffix eg. "r13b".

- A segmentation fault followed yet again after finding "strb" which might or might not have worked for the problem above.

- We observed some interesting output when changing line 11 to "adr" from "mov": "qemu: Unsupported syscall: 0" in the 10 iterations of the loop.

- In part 2 for the x86 concerns (which were greatly reduced):

1) Hex values were for some odd reason confused with the ASCII conversion on the divisor.

- As for Aarch64:

1) I was especially not a fan of how the "msub" instruction was explained on the wiki and therefore had some trouble implementing it. Later deduced that the structure of the instruction is as follows:

msub "register to store calculated remainder in" "register where first numeral to be divided is stored" "register where second numeral to be divided is stored" "quotient (whole number result of division)"

Below is our completed code for Part 2 for x86/64, followed by ARM64:

/* x86 */

.text
.globl _start

_start:

mov $0, %r15
mov $10, %r14 /*used to store the divider */

loop:

mov %r15, %rax
mov $0, %rdx
div %r14

add $0x30, %al /*convert to ascii */
add $0x30, %dl

mov %al, (msg + len - 3) /* store the byte */
mov %dl, (msg + len - 2)

movq $len,%rdx /* message length */
movq $msg,%rsi /* message location */
movq $0x1,%rdi /* file descriptor stdout */
movq $0x1,%rax /* syscall sys_write */
syscall

inc %r15
cmp $0x1e, %r15
jne loop

movq $0,%rdi /* exit status */
movq $60,%rax /* syscall sys_exit */
syscall

.data

msg: .ascii "Loop: ##\n"
len = . - msg

=================================================================

/* aarch64 */

.text

.global _start

_start:

mov x19, 0
mov x23, 10 /* used for dividing */
adr x24, msg

loop:
mov x20, x19/* calculate the byte */
udiv x21, x20, x23 /* r21 = i / 10 */
msub x22, x21, x23, x20 /* r22 = i - (r21 * 10) ie gets the remainder*/

add x21, x21, 0x30 /* convert to ascii */
add x22, x22, 0x30

strb w21, [x24, len - 3]/* set the byte */
strb w22, [x24, len - 2]

mov x0, 1 /* print the loop */
adr x1, msg
mov x2, len
mov x8, 64
svc 0

add x19, x19, 1 /* if not 30 revert to the beginning of the loop */
cmp x19, 30
bne loop

mov x0, 0
mov x8, 93
svc 0

.data
msg: .ascii "Loop: ##\n"

len = . - msg

Not included here, but completed later is the suppression of the zero digits on the quotient in the first 10 iterations of the loop, which ended up being a simple case of comparing the quotient to a zero stored in a register and skipping the register containing the 0 before output depending on whether or not the condition is true. This is was done like so:

x86 - cmp $0x00, %rax
je skip_10s

aarch64 - cmp x21, 0x00
beq skip

Friday, 17 January 2014

Observing Compiler Operations with C Source Code

A very simple hello world style function was written in C and compiled using the GCC for the purpose of being lastly disassembled using objdump in order to view and summarize the compiler translation of the main function as an exercise in basic understanding of assembler logic.

Here is the original GCC command used to compile the helloworld.c file:

gcc -g -O0 -fno-builtin helloworld.c -o hello

The following are 6 cases of program and compiler option manipulations that were tested and compared for potentially varied output and behavior. My particular group covered the 4th case in this scenario, that being the addition of multiple arguments to the printf() function:

When more than 1 argument is passed into the “printf” function in c source code, the compiler essentially adds in more MOVL (move-load) assembler functions into aligned addresses in the CPU registry that are descending by 4 hex digits (also concurrent with the amount of bits needed to store the length of the variable). When inserting a much larger “long integer”, however, the compiler needed to add in another MOVL function line in order to accommodate the width of the numeral into 8 bits of memory accordingly. The length of the variable doesn’t seem to make a difference (between int and long int). The string was MOV’d to a different registry (eax as opposed to esp which is also a stack pointer) and was not labeled as a pointer but called directly. The last major execution is a function call to printf from the procedure linking table. Interestingly enough, it was pointed out that as opposed to the compiler simply “PUSH”ing the variables into the stack, it utilized the more complex MOVL function due to a yet to be determined reason by us…

Below is the first screenshot of the compiler code with 11 argument paramaters, the first being a string and then next ten being simple 4 bit integers in ascension -

Here is the discrepancy with the edit in the long 18-digit integer:

The rest of my fellow classmates' cases involved manipulations with the compiler options, listed below along with descriptions of the results observed thereafter:

1) Adding the compiler option -static

The general consensus was that the file size became noticeably larger, due to the nature of the option disabling the dynamic linking in the imported libraries and thereby adding them in their entirety, as opposed to them being shared. The only seeming advantage to this is the fact that the user doesn't need to pre-install any of the libraries being called.

2) Removing the -fno-builtin option from the compilation command

The file size decreases slightly and the compiler replaced the printf() function with the less taxing puts() function, thereby optimizing it.

3) Removing the -g option from the compilation command

This option ends up removing any debugging information and headers from the output file, and thereby lowers its memory size at the cost of omitting any and all useful data for tracking errors in the code.

5) Moving the printf() call to a separate function, and call that function from main()

The snippet of code I ended up using looked like this:

In the objdump output what essentially ended up happening is that in main the callq ended using <newfunc> and then skipped over to newfunc's header and called <printf@plt> from there. Simply put, it ended up being the assembly logic equivalent of what happened in the C code.

6) Removing -O0 and add -O3 to the gcc options

This effectively switches the variation and breadth of optimization that is allowed to the compiler. In this particular case, it has changed from "do as little optimization as possible" to "get into overdrive". As a result, I could assume that it comes down to being a question of compile time vs. execution time. Here is the main observed difference in the compiler output:

Using the O0 option -

Using the O3 option -

The main difference in this code is the main function's assembler logic optimizations.

Tuesday, 14 January 2014

A Comparison of Code Review Processes

Over the past few days, I have looked over a pair of software package communities online that are ideally releasing their intellectual property under to different open source licenses in order to juxtapose any distinct difference between popular code reviewing techniques used in today's world.

My first example would have to be the community at the OpenMW project who I've been religiously following for a few years now. They have been operating under the GNU GPL version 3 license. Code reviewing and bug tracking in this project is maintained and updated using a project management app known as Redmine. Unfortunately, they have no readily available documentation that explicitly outlines their standards of coding practices, despite the presence of the correlating tab on the site. The features that clearly shines through in its ease of access and readability is the issue tracker and the roadmap being displayed. The issue tracker clearly indicates the issue type, priority, subject, assignee, and dates created and updated. Once the issue is clicked on, many further details are displayed as well as any related progress and discussion surrounding it. A great example of this can be seen here.

This next software project is called SilverStripe CMS - an open source web management tool. Their software is licensed under the BSD License, which they argue is more flexible and less restrictive than the GPL in cases such as companies wanting to integrate SilverStripe code into their product without revealing any custom codes of their own. SilverStripe now uses the familiar Github revision control system for issue tracking. In close similarity with the previously shown Redmine app, it incorporates easily accessible tickets for features and bugs that can be described, discussed, and updated on the same webpage. It also works in conjunction with its git environment where any changes are "pushed" and "pulled" by the appropriate users into the repository as necessary to achieve a specified milestone. Documentation on using some of the introductory features of git and github can be seen here.

Both of the projects found exhibit great use of tools to communicate and attain their personal goals, with the major difference being the more organized and hierarchical nature of SilverStripe which I would surmise is due to the breadth and age of the project itself, with its community being leagues larger than OpenMW's.