lundi 22 décembre 2014

Calypso to Mars: First contact

Having to write bindings for a C/C++ library has already taken an awful lot of time from D users, and it's not uncommon for one third of the front page announcements on the official D forums to be about work in progress bindings for libraries, yet there are still countless widespread libraries waiting for bindings to be written*.

If you want a taste of what it takes to bring a C++ library to communicate with D at the moment, have a look at the incomplete QtD or wxd. Then reckon the additional time you'd need if your project happened to depend on one or more C++ libraries for which neither a D equivalent nor bindings exist yet.

If the process was made shorter or even entirely automatic, one of the major obstacles to the adoption of D, which is an amazing language were it only for the metaprogramming features it adds to C++ and deserves a great future. But a strongly rooted language like C++ isn't easy to dislodge, let's see if we can loosen its grip... 

[*] There are no up-to-date bindings for most of the libraries I intended to use for my amateur game project : LLVM (no bindings beyond LLVM 3.0), Ogre3D (OpenMW had D bindings in a distant past before switching to C++, none since then), Bullet (someone is working on it, it's exposing a small subset of the Bullet API at the moment), Recast (none), cAudio (none), ...

Automatic generation of mirror declarations?

The first approach I've considered was to generate D declarations mirroring the C++ ones like htod already does, but encompassing all C++ features and without having to create a C++ → C → D layer in a SWIG (Simple Wrapper and Interface Generator) fashion. It should be noted that the bloat of a wrapper can be mitigated by link time optimization — Clang for example has a fourth level of optimization -O4 which produces LLVM bitcode files instead of standard .o files and when linking with LLVM-aware gold inlines function calls across object files and can thus fully inline the wrapper —, but SWIG for D still requires an interface file listing classes, namespaces, explicit template instantiations, ... to be maintained or generated.

So that first approach would simply be taking the current paradigm one step further, which makes it naturally tempting, but problems are numerous, and essentially comes down to : D's official C++ support is limited.

How to represent members of a class that are themselves instances of class since they can't be values in D? Something dreadful like char[__traits(classInstanceSize, CppClass)] and access it through functions? Declare the C++ classes as structs? In any case they couldn't inherit from one another, no __traits support, value semantics instead of reference semantics, virtual functions would have to be emulated, ...

The single inheritance C++ class model was added to DMD, but what about multiple inheritance (for which interfaces such as A : Base, I1, I2 aren't viable, they only map to a subset of C++ capabilities) ? Add it to the language specifically for C++ as well? Add a quirky built-in template to the compiler to avoid changing the language?
How to instantiate templates? By adding more quirks to the compiler or extending the CTFE to write to files or run external programs? What about virtual functions which are notoriously complex thanks to multiple inheritance?

Basically we cannot escape the use of a wrapper, or else going down that mirroring road would require many changes to DMD, making it closer to a C++ compiler.
We've been warned, it's even in the official documentation where someone (Andrei?) wrote this soul crushing anecdote:

« Being 100% compatible with C++ means more or less adding a fully functional C++ compiler front end to D. Anecdotal evidence suggests that writing such is a minimum of a 10 man-year project, essentially making a D compiler with such capability unimplementable. »

Okay rewriting yet another C++ compiler inside DMD surely is a lot of work, but why not reuse an existing C++ compiler and parts of its source code? Being already familiar with LLVM and Clang's source code from my past experience making a Lua-like language able to call C++ functions directly, it struck me that there was no reason to believe that this would be impossible or unbearably difficult.
This is the next approach, and has now proven to work.

Second approach

It is possible to make DMD completely C++-aware, cleanly and without having to recode the wheel : take parts of an existing C++ compiler and graft them (maybe in the form of a plugin) to the compiler, in order to handle something like:
import (C++) "header.hpp";   (this shouldn't be an import statement though, see next section)
The graft would expose symbols to the D compiler, and map as many D features as possible to the C++ compiler parts. It wouldn't require many changes in the DMD front-end other than placing some hooks here and there and make some methods overridable by the Clang glue.

Here's a quick sketch of how such a Clang-based glue could work:
  1. Calls clang with -emit-ast on a file containing a list of #include (the resulting file could be cached as long as new headers were already parsed for the AST -- good C++ headers are always order independent)
  2. Load the AST with clang::ASTReader, generate the virtual tree of packages/modules (separate from the D ones, name clashes between Phobos and the C++ STL)
  3. When those virtual modules are imported, expose Dsymbols mapped onto the Clang's AST
This way semantic analysis is handled by the clang executable, and DMD doesn't have to worry about most C++ subtleties.
Then on the middle-end of LDC, reuse some parts of Clang's Codegen to handle name mangling, function and method calls, class construction, building GEP instructions, etc.

With this approach adding most C++ features comes at little cost, any memory layout subtlety is handled by Clang, etc.

The tricky naming question

One issue that comes with automatic generation of modules from C++ headers is how to handle namespaces versus filenames : which names to choose for packages and modules?

The import statement has to know where the headers are located, but can we use the path as the names of the packages and the module as it's being done for D files and ignore the namespaces? There would be name conflicts. Keeping namespaces? We would have fairly long class and function names and be forced to enclose the entire code within with() statements to avoid this, and anyway fully qualifying imported symbols is not natural for D.

But the thing is : we can't use the filename at all in import statements. C++ standard headers are usually located inside /include or a C++ specific folder e.g #include <vector>, which would translate to import vector; isn't very D-ey, and furthermore and more importantly the implementation of the STL is rarely inside the exposed headers, the std::vector class is actually declared in bits/stl_vector.h.
Hence we would end up with something like import bits.stl_vector; which isn't D-ey at all and non-portable.

If a C++ library was to be rewritten in D we would map in a more natural way the packages to the C++ namespaces and choose a meaningful name for the module, and since it has to be chosen without a human touch I opted to put every class in a separate module named after them. The entire package/namespace can be imported at once so this shouldn't be too inconvenient.

But then we need to specify the headers to parse before the imports.

modmap (C++)

Introducing a new keyword is potentially code-breaking, but considering that it might become as essential as extern (C) is, a keyword aesthetically beats:
pragma (modmap, "C++""header.hpp");
Instead:
modmap (C++) "header1.hpp";
modmap (C++) "header2.hpp";
are used to specify the « #include directives » required by the module, which are then passed to Clang to generate the corresponding AST. Headers of well-written C++ libraries are order-independent which makes their "modularisation" possible.

Good for C++, but..?

Remains the question of C headers, which can't be split following namespaces. Modules for C/C++ are expected to arrive at some point in C++, but Clang has already took action and implemented its own promising module support for C since the end of 2012.

Their system deals with the problems inherent to the modularisation of C/C++ headers : a module map i.e a file mapping the headers to modules is handwritten or generated with the help of the modularize tool, then with -fmodules enabled #include becomes an import directive much like the D or C# one and the compiler precompiles headers into serialized AST files. Once they are imported for the first time, there's no need to reparse the headers and unlike precompiled headers this remains true as long as the global defines and the header don't change. Unfortunately although the documentation page shows a snippet of what could be a module map for the C standard library, it isn't in Clang's sources. That's unfortunate because an early unofficial standard to split common C libraries into modules would have been neat.

However perhaps we can not wait for Clang, reuse its module map file format (for C libraries only) and submit a module map for both D and Clang.

You'd notice that the question previously raised was essentially how to generate a module map. The module system of Clang isn't just a solution for C, we could actually use it for C++ as well if it did support C++1 and wasn't lacking one feature Calypso's companion of choice must possess : currently one module can only map to one or more headers that is, a single header cannot be split across multiple modules.

1 According to clang.llvm.org module support for C++ is « very experimental and broken » atm but no reason as to why, I couldn't find any further information.

As a matter of fact the current implementation of Calypso uses a precompiled header rather than Clang modules, and it will stay that way until Clang fix C++ modules. But I will use the module map file to fragment the global namespace as soon as possible.

So although imports work as intended, their implementation is more like a hack at the moment since the module name is used to load the record declaration (i.e there's no « true physical module », just a big PCH lazily loaded and we simply look up package::package::module), but while working with vanilla Clang and precompiled headers this is the easiest and more efficient way to implement them.

In short Clang modules are the most elegant solution especially for C, but it's not there yet.

Fast forward to Calypso

I'm tracking down the remaining bugs and missing code blocking the path to sophisticated C++ libraries as fast as possible, but although Calypso is still heavily WIP its C++ support is pretty wide already and it is close to be ready to use:

  • Global variables
  • Functions
  • Structs
  • Unions (symbol only)
  • Enums
  • Typedefs
  • C++ class creation  with the correct calls to ctors (destruction is disabled for now)
  • Virtual function calls
  • Static casts between C++ base and derived classes (incl. multiple inheritance offsets)
  • Mapping template implicit and explicit specializations already in the PCH to DMD ones, no new specialization on the D side yet
  • D classes inheriting from C++ ones, including the correct vtable generation for the C++ part of the class 

Calypso already works in numerous test cases (see tests/calypso), although as long as the Clang module maps aren't implemented importing declarations from the global namespace i.e any C library is going to be a huge mess.

Try it out:
https://github.com/Syniurge/Calypso

Follow the development on the D forums:
http://forum.dlang.org

Aucun commentaire:

Enregistrer un commentaire