Wednesday 16 April 2014

The Last Ditch Effort, Pt. 2

To my great dismay, I'm finding myself riddled with more roadblocks to a successful build and benchmark on the aarch64 port of Gnash as the semester comes to a close:

Word has reached me from the community regarding some of the inline assembly in "jemalloc.c". Apparently, as I've been researching, they integrated an updated version of the memory allocation algorithm into Gnash that already includes aarch64 implementation logic. Their github repository can be found and cloned using this link. The "pause" instructions for the cpu_spinwait loop logic blocks remain the same, but were moved to the configure.ac file, while a new memory allocation value for aarch64 was added as a pre-processor directive block in "jemalloc_internal.h.in" located in the include/jemalloc/internal directory:

# ifdef __aarch64__
#   define LG_QUANTUM       4
# endif

From reading the API's documentation, it simply serves to appropriately declare the minimum or ideally efficient amount of bytes for memory allocation use cases and is utilized in conjunction with other tools such as Valgrind for memory leak detection, memory allocation debugging, and profiling. The aforementioned block of code is one of many for various architectures. Below is a quoted paragraph from the implementation notes of the API that seem most relevant to the logic behind this code block:

"Traditionally, allocators have used sbrk(2) to obtain memory, which is suboptimal for several reasons, including race conditions, increased fragmentation, and artificial limitations on maximum usable memory. If --enable-dss is specified during configuration, this allocator uses both mmap(2)and sbrk(2), in that order of preference; otherwise only mmap(2) is used.
This allocator uses multiple arenas in order to reduce lock contention for threaded programs on multi-processor systems. This works well with regard to threading scalability, but incurs some costs. There is a small fixed per-arena overhead, and additionally, arenas manage memory completely independently of each other, which means a small fixed increase in overall memory fragmentation. These overheads are not generally an issue, given the number of arenas normally used. Note that using substantially more arenas than the default is not likely to improve performance, mainly due to reduced cache performance. However, it may make sense to reduce the number of arenas if an application does not make much use of the allocation functions.
In addition to multiple arenas, unless --disable-tcache is specified during configuration, this allocator supports thread-specific caching for small and large objects, in order to make it possible to completely avoid synchronization for most allocation requests. Such caching allows very fast allocation in the common case, but it increases memory usage and fragmentation, since a bounded number of objects can remain allocated in each thread cache.
Memory is conceptually broken into equal-sized chunks, where the chunk size is a power of two that is greater than the page size. Chunks are always aligned to multiples of the chunk size. This alignment makes it possible to find metadata for user objects very quickly.
User objects are broken into three categories according to size: small, large, and huge. Small objects are smaller than one page. Large objects are smaller than the chunk size. Huge objects are a multiple of the chunk size. Small and large objects are managed by arenas; huge objects are managed separately in a single data structure that is shared by all threads. Huge objects are used by applications infrequently enough that this single data structure is not a scalability issue.
Each chunk that is managed by an arena tracks its contents as runs of contiguous pages (unused, backing a set of small objects, or backing one large object). The combination of chunk alignment and chunk page maps makes it possible to determine all metadata regarding small and large allocations in constant time.
Small objects are managed in groups by page runs. Each run maintains a frontier and free list to track which regions are in use. Allocation requests that are no more than half the quantum (8 or 16, depending on architecture) are rounded up to the nearest power of two that is at least sizeof(double). All other small object size classes are multiples of the quantum, spaced such that internal fragmentation is limited to approximately 25% for all but the smallest size classes. Allocation requests that are larger than the maximum small size class, but small enough to fit in an arena-managed chunk (see the"opt.lg_chunkoption), are rounded up to the nearest run size. Allocation requests that are too large to fit in an arena-managed chunk are rounded up to the nearest multiple of the chunk size.
Allocations are packed tightly together, which can be an issue for multi-threaded applications. If you need to assure that allocations do not suffer from cacheline sharing, round your allocation requests up to the nearest multiple of the cacheline size, or specify cacheline alignment when allocating.
Assuming 4 MiB chunks, 4 KiB pages, and a 16-byte quantum on a 64-bit system, the size classes in each category are as shown in Table 1."

Table 1. Size classes
CategorySpacingSize
Smalllg[8]
16[16, 32, 48, ..., 128]
32[160, 192, 224, 256]
64[320, 384, 448, 512]
128[640, 768, 896, 1024]
256[1280, 1536, 1792, 2048]
512[2560, 3072, 3584]
Large4 KiB[4 KiB, 8 KiB, 12 KiB, ..., 4072 KiB]
Huge4 MiB[4 MiB, 8 MiB, 12 MiB, ...]

========================================================================

I have yet to hear from the community in relation to my further inquiry about the minimalist gnu for windows asm block in "utility.h" and whether or not there is a way or a necessity to port this operation into aarch64. The code block can be seen below:

#ifndef __MINGW32__
#undef assert
#define assert(x)           if (!x)) { __asm { int 3 } }
#endif

From my own research, there is a way to build MingW on cross-platforms for whatever required processor, via these how-to instructions that effectively tell you to download the required libraries and change the specific target variables to your particular processor/architecture. Either way, the arm instruction equivalent that I would have tested would have been a "BRK 0" instruction replacing the "INT 3" Intel x64 call for a debugging breakpoint, as mentioned in my earlier march summary recap post. Unfortunately, I cannot test any of this proposed logic due to the dependency issues in the build, discussed below.

My biggest trouble has been trying to simply run the configure and make files on the qemu environment. The purported development version of Gnash is currently 0.8.11, yet in any of the channels that I've tried to download the source code from, including the repository clone from github, the repository from fedpkg, and a few Canadian FTP Mirrors for all GNU-related packages, everything stops at 0.8.10 even though 0.8.11's changelog is already posted up on Gnash's Wiki page

Regardless, after running the ./configure command with the added parameter of changing the build to aarch64 in the package's present state ("./configure --build=aarch64-unknown-linux"), I cannot install one crucial missing dependency on the qemu environment according to the resulting output - that being xulrunner-devel. This is a mozilla-related package that according to this Bugzilla discussion has still yet to be upstreamed in an aarch64-compatible version as of the 10th of April. This is further proven with the lack of any matches in relevant yum repositories of the package when trying to yum install or run a yum search all query.

In conclusion, with the added scheduling and time constraints put on by the 14-week curriculum of the semester structure and the workload of other classes, I was not able to achieve the desired and ideal amount of work in order to achieve the optimal amount of progress for a package with the amount of obstacles I have come across such as Gnash. My hope is that my research and work herein will serve as a potentially useful tool for the rest of the community going forward in the effort to successfully build, run, and optimize this package for 64-bit ARM implementations. I will upload a public github repository of my environment in its current state and post a link to it on this blog at the community's behest.

No comments:

Post a Comment