Wednesday 16 April 2014

The Last Ditch Effort, Pt. 2

To my great dismay, I'm finding myself riddled with more roadblocks to a successful build and benchmark on the aarch64 port of Gnash as the semester comes to a close:

Word has reached me from the community regarding some of the inline assembly in "jemalloc.c". Apparently, as I've been researching, they integrated an updated version of the memory allocation algorithm into Gnash that already includes aarch64 implementation logic. Their github repository can be found and cloned using this link. The "pause" instructions for the cpu_spinwait loop logic blocks remain the same, but were moved to the configure.ac file, while a new memory allocation value for aarch64 was added as a pre-processor directive block in "jemalloc_internal.h.in" located in the include/jemalloc/internal directory:

# ifdef __aarch64__
#   define LG_QUANTUM       4
# endif

From reading the API's documentation, it simply serves to appropriately declare the minimum or ideally efficient amount of bytes for memory allocation use cases and is utilized in conjunction with other tools such as Valgrind for memory leak detection, memory allocation debugging, and profiling. The aforementioned block of code is one of many for various architectures. Below is a quoted paragraph from the implementation notes of the API that seem most relevant to the logic behind this code block:

"Traditionally, allocators have used sbrk(2) to obtain memory, which is suboptimal for several reasons, including race conditions, increased fragmentation, and artificial limitations on maximum usable memory. If --enable-dss is specified during configuration, this allocator uses both mmap(2)and sbrk(2), in that order of preference; otherwise only mmap(2) is used.
This allocator uses multiple arenas in order to reduce lock contention for threaded programs on multi-processor systems. This works well with regard to threading scalability, but incurs some costs. There is a small fixed per-arena overhead, and additionally, arenas manage memory completely independently of each other, which means a small fixed increase in overall memory fragmentation. These overheads are not generally an issue, given the number of arenas normally used. Note that using substantially more arenas than the default is not likely to improve performance, mainly due to reduced cache performance. However, it may make sense to reduce the number of arenas if an application does not make much use of the allocation functions.
In addition to multiple arenas, unless --disable-tcache is specified during configuration, this allocator supports thread-specific caching for small and large objects, in order to make it possible to completely avoid synchronization for most allocation requests. Such caching allows very fast allocation in the common case, but it increases memory usage and fragmentation, since a bounded number of objects can remain allocated in each thread cache.
Memory is conceptually broken into equal-sized chunks, where the chunk size is a power of two that is greater than the page size. Chunks are always aligned to multiples of the chunk size. This alignment makes it possible to find metadata for user objects very quickly.
User objects are broken into three categories according to size: small, large, and huge. Small objects are smaller than one page. Large objects are smaller than the chunk size. Huge objects are a multiple of the chunk size. Small and large objects are managed by arenas; huge objects are managed separately in a single data structure that is shared by all threads. Huge objects are used by applications infrequently enough that this single data structure is not a scalability issue.
Each chunk that is managed by an arena tracks its contents as runs of contiguous pages (unused, backing a set of small objects, or backing one large object). The combination of chunk alignment and chunk page maps makes it possible to determine all metadata regarding small and large allocations in constant time.
Small objects are managed in groups by page runs. Each run maintains a frontier and free list to track which regions are in use. Allocation requests that are no more than half the quantum (8 or 16, depending on architecture) are rounded up to the nearest power of two that is at least sizeof(double). All other small object size classes are multiples of the quantum, spaced such that internal fragmentation is limited to approximately 25% for all but the smallest size classes. Allocation requests that are larger than the maximum small size class, but small enough to fit in an arena-managed chunk (see the"opt.lg_chunkoption), are rounded up to the nearest run size. Allocation requests that are too large to fit in an arena-managed chunk are rounded up to the nearest multiple of the chunk size.
Allocations are packed tightly together, which can be an issue for multi-threaded applications. If you need to assure that allocations do not suffer from cacheline sharing, round your allocation requests up to the nearest multiple of the cacheline size, or specify cacheline alignment when allocating.
Assuming 4 MiB chunks, 4 KiB pages, and a 16-byte quantum on a 64-bit system, the size classes in each category are as shown in Table 1."

Table 1. Size classes
CategorySpacingSize
Smalllg[8]
16[16, 32, 48, ..., 128]
32[160, 192, 224, 256]
64[320, 384, 448, 512]
128[640, 768, 896, 1024]
256[1280, 1536, 1792, 2048]
512[2560, 3072, 3584]
Large4 KiB[4 KiB, 8 KiB, 12 KiB, ..., 4072 KiB]
Huge4 MiB[4 MiB, 8 MiB, 12 MiB, ...]

========================================================================

I have yet to hear from the community in relation to my further inquiry about the minimalist gnu for windows asm block in "utility.h" and whether or not there is a way or a necessity to port this operation into aarch64. The code block can be seen below:

#ifndef __MINGW32__
#undef assert
#define assert(x)           if (!x)) { __asm { int 3 } }
#endif

From my own research, there is a way to build MingW on cross-platforms for whatever required processor, via these how-to instructions that effectively tell you to download the required libraries and change the specific target variables to your particular processor/architecture. Either way, the arm instruction equivalent that I would have tested would have been a "BRK 0" instruction replacing the "INT 3" Intel x64 call for a debugging breakpoint, as mentioned in my earlier march summary recap post. Unfortunately, I cannot test any of this proposed logic due to the dependency issues in the build, discussed below.

My biggest trouble has been trying to simply run the configure and make files on the qemu environment. The purported development version of Gnash is currently 0.8.11, yet in any of the channels that I've tried to download the source code from, including the repository clone from github, the repository from fedpkg, and a few Canadian FTP Mirrors for all GNU-related packages, everything stops at 0.8.10 even though 0.8.11's changelog is already posted up on Gnash's Wiki page

Regardless, after running the ./configure command with the added parameter of changing the build to aarch64 in the package's present state ("./configure --build=aarch64-unknown-linux"), I cannot install one crucial missing dependency on the qemu environment according to the resulting output - that being xulrunner-devel. This is a mozilla-related package that according to this Bugzilla discussion has still yet to be upstreamed in an aarch64-compatible version as of the 10th of April. This is further proven with the lack of any matches in relevant yum repositories of the package when trying to yum install or run a yum search all query.

In conclusion, with the added scheduling and time constraints put on by the 14-week curriculum of the semester structure and the workload of other classes, I was not able to achieve the desired and ideal amount of work in order to achieve the optimal amount of progress for a package with the amount of obstacles I have come across such as Gnash. My hope is that my research and work herein will serve as a potentially useful tool for the rest of the community going forward in the effort to successfully build, run, and optimize this package for 64-bit ARM implementations. I will upload a public github repository of my environment in its current state and post a link to it on this blog at the community's behest.

Saturday 12 April 2014

The Last Ditch Effort, Pt. 1

After presenting the current state of my work on the Gnash package, I am currently back to solving the build dependency debacle before inserting the assembly translations into the package code. "yum-builddep gnash.spec" immediately proved unsuccessful, so after a few attempts at troubleshooting via google I went ahead and used the piped sed command combination generously given to me by Chris Tyler:

"yum install -y $(cat *spec | sed -n "s/^BuildRequires://p" | sed "s/>=.*//")"

Effectively, this stream editing command parsed through the gnash.spec file and searched for any and all build dependency packages written after the "BuildRequires:" string using the regular expressions embedded within, and finally redirected the output to the appropriate portion of the yum-install command.

As far as my research of the found assembly code, the code in the affected files - "jemalloc.c" and "utility.h" wrap the pre-processor blocks in the constraints "#ifdef __i386__", "#ifdef __amd64__", and "#ifndef __MINGW32__" respectively. Unfortunately, due to google being of very little help in such niche specializations, I am currently awaiting word from the community if any of this logic needs to be reapplied for aarch64 - especially the memory allocation logic of the blocks in jemalloc.c, for example:

#ifdef __arm__
#  define QUANTUM_2POW_MIN     4
#  define SIZEOF_PTR_2POW         3
#  define NO_TLS
#endif

More to come in the next few days.

Tuesday 1 April 2014

Comprehensive March Recap

As expected, the latter cadence of the current semester has been nothing short of chaotic with the general relegation of project work and test preparation, so an apology is due on behalf of yours truly for not consistently updating on the progress of the software package port. The progress below will be chronologically (to the best of my ability) subdivided into relevant sections in order to best reference any and all steps, obstacles, and solutions that have been encountered over the past month:

Acquisition of the package - 

Shortly after the codebase analysis lab (found here) covering Unix commands, file extensions, and directories that were and are useful in the importation of software packages and the search for assembly code and its implications, a list of software packages and libraries carried off of Linaro's Linux on ARM64 porting project was posted on the course wiki in order to give students a reasonable amount of time to decide on an appropriate package that satisfies the prescribed requirements (http://zenit.senecac.on.ca/wiki/index.php/Winter_2014_SPO600_Software_List). After careful consideration that includes downloading the package, analyzing the scope of the work involved in order to successfully be able to either port or eliminate any assembly-related dependencies to the new architecture within the time constraints of the semester, my sights were set on the GNU foundation's open source flash decoder package - Gnash.

About Gnash - 

The GNU flash player. An open source, web-based video and audio decoder dealing mostly with .swf files, up to version 9. The software is based on a library called gameswf. Any other details regarding the software's overview are found on its homepage. The developement community is rather scarce at this point, with one individual on their IRC channel pointing me to the mailing list as being the last bastion of community discussion in relation to bugs or patches. After having subscribed to it for at least a month, there has been no ongoing activity whatsoever, unfortunately.

Local Environment Setup - 

Given the options provided by Chris Tyler for analysis, implementation, and benchmarking, as well as the constraints of running Windows 8 natively on my local machine before embarking on this course, my initial task was to install Fedora 20 on a virtual machine interface. Using Oracle's Virtualbox software, a quick Google search ended at a very useful Wikihow Article on an efficient way to install fedora on windows using the Oracle VM.

Installing the 64-bit ARM emulator available to us on the Ireland remote server on the local machine seemed like the next ideal task to accomplish and was achieved with relative ease using the SPO600 Wiki's instructions at the bottom of the Qemu overview page. After a slight misunderstanding of where to properly transfer the "qemu-arm64.conf" file to the local machine (the etc/binfmt.d directory path relative to root and not the newly created arm64 directory used for the emulator) and some fumbling around with the tar command options for unpacking compressed version of the environment, it was mostly straight-forward.

Lastly, downloading and installing the Fedora AArch64 Foundation Model was a slightly odd experience, with the biggest stumble occuring when trying to download the files from the arm64 website and not seeing the actual "download" link appear on the bottom of the page. This was promptly solved by reverting to the Windows machine to download the needed file and e-mail it to myself back on the virtual machine.

The x86_64 Build and First Bugfix -

After peering, perusing around, and finding the assembly code necessary to begin the research and work involved into directly translating and compiling the package on Qemu, the first attempt at building Gnash on x86 proved to be unsuccessful after a 30 minute delay. The culprit was determined to be a nullptr exception variant in two cases in the npruntime.h file. Simply opening the file and editing the code manually proved sufficient enough for the build to be successful on the 2nd run through.

Ongoing aarch64 Porting Implementation -

Using the recursive grep command on patterns containing "asm" "__asm" or "(asm", the output resulted in the following results:

gnash-0.8.10/libbase/jemalloc.c:#  define CPU_SPINWAIT __asm__ volatile("pause")
gnash-0.8.10/libbase/jemalloc.c:#  define CPU_SPINWAIT __asm__ volatile("pause")

gnash-0.8.10/libbase/utility.h:#define assert(x) if (!(x)) { __asm { int 3 } }

With professor Tyler's advice, as well as deductive research in the ARM ISO, I have managed to narrow down the aarch64 alternatives to be a BRK 0 instruction as a replacement for the "int 3" debugging breakpoint macro, and the "pause" command will be substituted for a "YIELD" command on the Qemu environment in relation to spin/wait loops.

I am using a few references to look further into these operations and their purpose:

Stack Overflow Query
INT (x86_instruction) - Wikipedia
ARMv8 Instruction Set Overview and Developer Manual

This is currently in the progress of being implemented and tested.

Miscellaneous Calamities -

A mistake of seemingly seismic proportion occurred on my part on March 27th. In the middle of a lecture, a notice to update the operating system popped up at the bottom of my screen. When an update process is ongoing, it effectively downloads the new dependencies first before deleting the old ones. Whether or not this is a symptom of default settings on the virtual machine or not needs to be looked into further, but nevertheless the update process was likely interrupted by the virtual machine powering off right in the middle of downloading and installing new versions of packaging and deleting the older versions akin to what was just installed, ultimately resulting in a GUI-related package called "gdm" that contained duplicate dependencies and other packages using obsolete versions which translate to the operating system being unable to properly load any of the user interface layers that aren't the terminal environment. The subsequent 4 days following that consisted of complete and utter mental stress, chaos, and anguish, only to be solved by (yet again) the technical expertise and prowess of Chris Tyler. The residing lesson here would be never to update your virtual machine's operating system without unassailable confidence that you know and have complete control over the processes that will be executing as well as their timings.