Organisation: | Copyright (C) 2008-2024 EDF R&D |
---|---|
Contact: | olivier (dot) boudeville (at) edf (dot) fr |
Author: | Olivier Boudeville |
Creation Date: | February 2009 |
Lastly updated: | Tuesday, February 13, 2024 |
Version: | 2.4.7 |
Status: | Stable |
Website: | http://sim-diasca.com |
Dedication: | To the Sim-Diasca developers |
Abstract: | This document summarizes the main conventions that should be respected when contributing code to Sim-Diasca and/or making use of Sim-Diasca. |
Table of Contents
Note
This document intends to gather information mostly aimed at Sim-Diasca maintainers or contributors - not users. Most people should read the Sim-Diasca Technical Manual first, and possibly the Sim-Diasca Modeller Guide as well.
However we require that the in-house authors of any simulation element making use of Sim-Diasca's services (e.g. models, simulation cases, etc.) respect the conventions presented in the current document, for the sake of the clarity and homogeneity of the code base.
As we believe that these conventions may benefit to third-party users, we share them as well.
At all levels of the technical architecture, we tried to enforce a few conventions, that are detailed below.
Some of them are necessary, others are mere good practices, a few are arbitrary (notably in terms of style), but we believe that, for the sake of clarity and homogeneity, all of them should be respected in the code of the Sim-Diasca stack, and preferably also in code using it (typically models, tools, etc.).
(see the credits section about the comic strips)
We believe also that these conventions have been fairly well enforced in our own Sim-Diasca code base (which thus might be used as an example thereof). Please tell us if you do not think so, or if you identified interesting other conventions that could be listed here and applied.
We now recommend to directly stick to the Myriad ones [mirror].
Just ensure you typed everything properly:
We see Sim-Diasca as a stack of layers, so that a given layer only depends on the ones below it, and never from the ones above.
Top-to-bottom, we have:
Layer Name | Role |
---|---|
Sim-Diasca | This simulation engine |
Ceylan-Traces | The distributed trace system |
Ceylan-WOOPER | The object-oriented layer |
Ceylan-Myriad | The base library offering general-purpose services |
Erlang | The base language and environment |
Thus, there is not upward dependency, for example WOOPER depends on Myriad and Erlang but not on Traces or on Sim-Diasca.
The other way round, bottom-up one can see:
On top of that stack, which provides the simulation engine, there are at least:
The simulation engine being itself absolutely generic as long as discrete-time simulations are involved, it may be convenient to define, on top of Sim-Diasca and below the actual models themselves, a domain-specific layer that specialises the framework in order to ease the development of models.
For example, a telecom-centric simulation could define building blocks like service queues, and mother classes like communicating nodes, network interfaces, packet loss models, etc.
The most obvious conventions are:
[1] | In terms of directories, the source of modules (*.erl) shall be in src, the includes (*.hrl) in include, the tests (*_test.erl) in test - in each case, either directly in the specified directory, or at any depth in nested subdirectories. |
Note
Mute variables are however actually bound, thus if for example there is in the same scope _Min=3 and later _Min=4, then a badmatch will be triggered at runtime; therefore names of mute variables should be generally kept unique in a given scope.
For the sake of clarity, we try to avoid too compact code, and code too poorly understandable for everyone but its original creator. Thus we want to enforce a minimum ratio of blank lines and comments.
For example, as of May 2017, we have for the Sim-Diasca stack (i.e. from myriad to sim-diasca):
These information can be obtained by running make stats from the root of a Sim-Diasca install.
Other recommended good practices are:
Another piece of advice we maybe should apply more frequently:
Of course we cannot stress enough that securing a sufficient code quality is essential for the other developers to come, and also often even for one's future self; for that reason we recommend pair-programming, or at least the aforementioned review before new sources are incorporated in the code base:
The general goal is to collect recurring generic lower-level patterns and solutions in that layer.
When an helper mechanism is already available in Myriad, it should be used instead of being defined multiple times in the software stack.
Reciprocally, when a well-defined generic sequence of instructions is used more than once, it should be integrated (commented and tested) in that Myriad layer.
Main services are:
For further information, please refer to the Technical Manual of the Myriad Layer [2].
[2] | It can be generated by running make full-doc from myriad/doc. |
One should respect the WOOPER conventions (please refer to the WOOPER documentation).
For example, oneway, request or helper function (possibly with qualifiers like const) should be specified in each method header.
WOOPER type conventions shall be used as well, for example:
-spec getFoo( wooper:state() ) -> request_return( foo() ).
Method names should be spelled in CamelCase (e.g. getColorOf, not get_color_of).
Variables bound to a WOOPER state shall have their name suffixed with State; e.g. NewState, UpdatedState, etc.
To better discriminate between methods and functions (e.g. helpers):
Example:
% The class-specific attributes of an instance of this class are: -define( class_attributes, [ { current_step, step_count(), "current step at which the experiment is" }, [...] ]).
So that log messages can be kept track of over time, a distributed trace system is provided, with relevant components: trace emitter, listener, supervisor, aggregator, etc.
At implementation time, one just has to choose:
Once building the simulator, one can configure:
At execution time, the command-line option --batch can be specified, which causes all interactive elements to be disabled, including any trace supervisor (like the LogMX browser). It can be specified that way:
$ make my_case_run CMD_LINE_OPT="--batch"
For convenience, often the developer defines in his shell environment file export BATCH='CMD_LINE_OPT="--batch"', so that he can make use of this shorthand instead:
$ make my_case_run $BATCH
(this is very convenient when debugging brand new code: before having to peer at the traces, runtime errors may occur, and the relevant information, like the stack trace, the actual parameters and the current instance state will be displayed on the console)
The trace system has been designed with performance and scalability in mind, thus if disabled no per-instance overhead penalty will be incurred at all, and, if enabled, efforts have been made so that as large as possible a number of traces are to be managed by the trace system, for any given resources in terms of network, memory and processing.
Should numerous traces be sent, it could be interesting to create more than one instance of the trace aggregator class, for example:
An overloaded trace aggregator will notify the user, so that the verbosity of later runs can be decreased.
In this section the conventions to be respected with regard to Sim-Diasca are detailed.
The Sim-Diasca services should be used whenever applicable. Notably, bypassing the simulation mechanisms (for actor creation, communication, deletion, etc.) is absolutely prohibited, as essential properties, like the respect of causality, would then be lost.
Such a short-time "simplification" would be considered still more harmful than others like the use of the goto statement in C:
To implement the communication between two actors, neither plain Erlang messages nor arbitrary WOOPER messages can be used: as we aim to develop a distributed simulation, as opposed to a mere distributed application, we have to make use of the Sim-Diasca mechanisms for automatic inter-actor message reordering.
This means that, for inter-actor communication purpose, only the following helper functions shall be used:
Of course one may rely on higher-level specialization layers, making use of Sim-Diasca but providing another API. The point is that inter-actor messages should be ultimately managed by Sim-Diasca, rather than being sent directly with the ! operator provided by Erlang.
Very similarly, life-cycle of actors should be ultimately managed by the engine, not directly by the developer: the creation and termination of actors must respect the conventions detailed below.
An actor can be created either initially (before the simulation is started) or at simulation-time (i.e. in the course of the simulation), as detailed in next sections.
Calling directly (i.e. from the user code) any spawn variation or any WOOPER new variation (e.g. remote_new_link) is totally prohibited: we must rely on the Sim-Diasca mechanisms, and not attempt to bypass them.
The actual creation will be performed by the load balancer, on a computing node of its choice, and the placement will be fully transparent for the model writer.
Whether an actor is initial or not, construction parameters are to be supplied for its creation.
These construction parameters will be specified as a list, for example [_Age=5,_Gender=male]. The matching constructor must then be in the form of:
construct( State, ActorSettings, P1, P2 ) ->
Note that two additional initial parameters appear here:
So a typical constructor for a model class_M1, inheriting from, for example, class_SpecialisedActor and from classes class_C1 and class_C2, could be:
% Constructs a new instance of class_M1: construct( State, ActorSettings, P1, P2, ... ) -> % Will result ultimately in a call to % class_Actor:construct( State, ActorSettings, AName ): SpecialisedState = class_SpecialisedActor:construct( State, ActorSettings, ... ), C1State = class_C1:construct( SpecialisedState, P1, ... ), C2State = class_C2:construct( C1State, P1, P2, ... ), [...] FinalState.
Note how states are chained (one being built on top of the other), from the blank, initial one (State) to the one corresponding to the complete initial state of the instance (FinalState), as returned by the constructor.
We strongly encourage the use of type specifications, which would be here:
-spec construct( wooper:state(), class_Actor:actor_settings(), type_of_p1(), type_of_p2() ) -> wooper:state().
An abstraction API is available to create from a simulation case initial actors, i.e. bootstrapping actors, which are created before the simulation is started.
It is generally based on the class_Actor:create_initial_actor/2 static method:
ActorPid = class_Actor:create_initial_actor( ActorClassName, ActorConstructionParameters )
For example, in my_example_test.erl we could have:
ActorPid = class_Actor:create_initial_actor( class_PinkFlamingo, [ _Age=5, _Gender=male ] )
Should multiple initial actors have to be created, using this method would be less than optimal, as the load-balancer would be looked-up in the process registry at each call of this static method, which, if creating thousands of actors in a row, could induce some overhead.
Therefore a more efficient alternative is available, the class_Actor:create_initial_actor/3 static method, for which the PID of the load-balancer is to be specified as a parameter, having thus to be looked-up only once in the simulation case:
LoadBalancerPid = LoadBalancer:get_balancer(), FirstActorPid = class_Actor:create_initial_actor( Class1, Parameters1, LoadBalancerPid), SecondActorPid = class_Actor:create_initial_actor( Class2, Parameters2, LoadBalancerPid), [...]
A typical use case is to load from any source (file, database, etc.) a set of construction parameters for a large number of instances.
For larger cases, creating actors sequentially may lead to very significant simulation start-up durations.
In such cases, class_Actor:create_initial_actor/1 should be used instead : then a smart, parallel, batched creation will be done, allowing to create all instances as efficiently as reasonably possible.
This results in a considerably faster creation of the initial state of the simulation, provided there is no dependency between the created actors in the specified batch. Otherwise actors should be created in multiple stages, to ensure that the PID of the prerequisite actors is already known and can be specified at a later stage, when in turn creating the actors whose constructor requires these PIDs.
For non-programmatic, file-based initialisation, we strongly recommend using our rather advanced loading system, as described in the technical guide (see its Sim-Diasca Management of Simulation Inputs section).
All initial operations (i.e. all operations to be triggered before the simulation starts) must be synchronous, to ensure they are indeed finished once the simulation is run: the simulation case has to wait for their completion before greenlighting the start the simulation.
This involves the use of:
Otherwise there could be a race condition between the end of these initial operations (which may take any time) and the triggering of the simulation start (a message which, without flow control, could be sent too early by the simulation case).
When creating initial actors, we might find useful to create an actor A that would create in turn other initial actors, and so on (nested creations).
This is possible, however these creations should not be directly done from the constructor of A, as this would lead to a systematic deadlock by design [3]. Some solutions have been identified, but they were not satisfactory enough [4].
[3] | A deadlock will occur because the load balancer will be blocked waiting for the creation of actor A to finish, thus paying no attention to the requested creations in-between, while they themselves are waited for the creation of A to complete. |
[4] | A non-blocking solution could be to have a load balancer that does not wait for an instance to acknowledge that its spawn is over: the load balancer would thus return immediately and keep track of the spawn_successful message (interpreted as a oneway) that it should receive before the simulation starts. However in that case no total order in actor creation seems to be possibly guaranteed: actor A could create B and C, which themselves could, after some processing, create others actors. As a consequence B and C would create them concurrently, and, depending on various contextual factors, their creation requests could be received by the load balancer in no particular order, leading to a given actor bearing different AAI from one run to another. Nested creations would thus be obtained at the expense of reproducibility, which is not wanted. |
Instead, the constructor of A should just create A and return, and the actual creations of other actors should be triggered by a subsequent method call (a request, not a oneway, as explained in the Synchronicity section).
For example, in my_creation_test.erl, we could have:
[...] ActorAPid = class_Actor:create_initial_actor( ClassA, ParametersForA ), ActorAPid ! { createDependingActors, [], self() }, actors_created = test_receive(), [...]
Note that test_receive/0 corresponds to a safer form than receive {wooper_result, R} -> R end. It is logically equivalent, but immune to interfering messages that could be sent to the simulation case by other Sim-Diasca services (e.g. notifications from the trace supervisor).
Once the simulation is started, an actor can only be created by another one (for example it then cannot be created directly by the simulation case itself), so that a correct simulation time can be enforced.
The creating actor should call the class_Actor:create_actor/3 helper function for that creation, like in:
CreationState = class_Actor:create_actor( Classname, ConstructionParameters, State ), [...]
If called at simulation timestamp {T,D}, then the specified actor will be actually created (by the load-balancer) at {T,D+1}, and at {T,D+2} the creating actor will know (as its onActorCreated/5 method will be called) the PID of the just created actor.
The creating actor - and any other actor that will be given the returned PID - can then freely interact with the created actor (of course thanks to actor messages), exactly as with any other actor (once its creation is performed, there is no difference between an actor created in the course of the simulation and an initial actor).
Regardless of whether a creation is to happen initially or on the course of the simulation, it is often a lot more efficient to ensure that sets of actors known to be tightly coupled are created on the same computing host (i.e. are co-allocated).
Otherwise these actors would be scattered by the load balancer on multiple computing hosts according to its placement policy, i.e. regardless of their relationship (since the load balancer has no a priori knowledge about the interactions between models), which would lead in the general case to a useless massive network overhead, and thus to simulations that would be considerably slowed down.
Sim-Diasca offers a way of forcing co-allocation (i.e. to ensure that a set of actors will be in all cases created on the same computing host, no matter of which host it is), thanks to placement hints.
A placement hint can be any Erlang term (atoms are generally used for that purpose), that can be specified whenever an actor is created. The engine guarantees that two actors created with the same placement hint will end up being instantiated (by the load balancer) on the same computing host [5].
[5] | Unless a compute node was lost in the course of a simulation that recovered from it. |
So Sim-Diasca provides a counterpart to its basic creation API, whose functions are just expecting one extra parameter, the aforementioned placement hint:
Except the hint specification, these functions work exactly as their counterpart (e.g. w.r.t. the call to onActorCreated/5).
For example, if devices in a house were to be modelled, and if a large number of houses was to be simulated, then for house 437, the placement hint (as an atom) house_437 could be specified for the house creation, as well for the creation of each of the devices it will contain.
That way they would be all created and evaluated on the same computing host, exchanging numerous local messages with no need for costly and slow networked messages.
Removing an actor from the simulation is a bit more complex than inserting a new one, due to pending inter-actor relationships that may interfere with the actor termination.
An actor A should not decide that another actor B is to be removed immediately from the simulation. Notably, sending a delete message to B means just calling directly the WOOPER destructor and therefore bypassing the Sim-Diasca simulation layer and making the simulation freeze or fail on error [6].
[6] | Indeed actor B would then terminate immediately, either causing the time manager to wait for it unsuccessfully (if the tick of B was not finished yet) or possibly making it be removed from the simulation whereas another actor could still send an actor message to it, thus being blocked forever, waiting for an acknowledgment that would never come. Moreover the time manager intercepts actor deletions and checks that they were indeed expected. |
Instead the actor A should send an actor message to actor B (if ever B is not just to terminate solely on its own purpose), resulting on the corresponding oneway of B to be triggered. Then B may or may not choose to terminate, immediately or not. Alternatively B may, by itself, determine it is time for it to be removed from the simulation.
In any case, B will decide that it terminates, at {T,D}. The main conditions for its deletion is that:
To emphasize more, the model developer should ensure that, once an actor is terminated, no other actor expects to interact with it anymore (i.e. that all other actors should stop sending actor messages to it). The objective is therefore to delay appropriately the triggering of the termination of an actor until all possibilities of outside interactions are extinguished.
The smallest duration for a termination procedure cannot be automatically determined, as the PID of the terminating actor (B) can have been transmitted in the meantime from actors to actors. Therefore it is the duty of the developer to ensure that a terminating actor B is safely unregistered from all the actors that may interact with it in the future (generally a small subset of the ones that know its PID). Often this unregistering procedure is best done directly from the actor B itself. Then only B can safely terminate.
Two options exist for a proper termination procedure:
The first option is by far the simplest and most common: B simply calls class_Actor:declareTermination/1, and, starting from the same diasca, notifies any actor of its deletion. The notification chain will unfold on as many diascas as needed, and once all the diascas for the current tick will be over, a new tick will be scheduled and B will then be deleted automatically.
The second option is more precise but more demanding, as it requires B to be able to determine an upper-bound to the number of diascas that can elapse before it can safely terminate (thus without waiting for the next tick to happen).
Such a feature is provided so that, during a tick, any number of actor creations, deletions and interactions may happen, "instantaneously", and according to any complex pattern.
For example, B may know that only actor C knows it, in which case B will notify C of its termination immediately, implying that starting from {T,D+2} C is expected to never interact with B anymore (C will receive and process the message at {T,D+1} but due to message reordering C might already have sent a message to B at this timestamp - in the general case B should ignore it).
In this context B is to call class_Actor:declareTermination/2, with a termination delay of 2 diascas. A larger delay would have to be specified if C had to notify in turn D, and so on...
With both termination options, once class_Actor:declareTermination/{1,2} is called, the engine will take care of the appropriate waiting and then of the corresponding deletion, with no further intervention.
Note that:
Note
The proper termination of an actor results into a normal termination, not in a crash. Therefore processes (including other actors) that would be linked to a terminating actor will not be terminated in turn because of it.
On the other hand, as soon as an actor crashes, the simulation is expected to fully crash in turn, in order to avoid silent errors; knowing that anyway no automatic fall-back to a crash can be defined, since it generally means there is a bug in the code of at least a model.
In the code of an actor (i.e. inheriting from a class_Actor child instance), the only attributes inherited from Sim-Diasca that should be directly accessed from models is trace_categorization, to provide from the constructor various ways of selecting trace messages afterwards.
All other attributes inherited from a class_Actor instance should be regarded as strictly private, i.e. as technical details of the engine that are not of interest for the model developer (neither in terms of reading nor of writing).
Of course the developer is free of defining any class hierarchy, with each specialising class defining all (non-colliding) attributes needed.
Action | Correct | Incorrect |
---|---|---|
Initial Actor Creation (before the simulation start) | class_Actor:create_initial_actor/2 (directly from the simulation case) | Use of a variation of spawn or of WOOPER new |
Runtime Actor Creation (in the course of the simulation) | class_Actor:create_actor/3 (only from another actor) | Use of a variation of spawn or of WOOPER new |
Actor Communication | class_Actor:send_actor_message/3 | TargetActor ! AMessage |
Actor Termination Decision | Notify relevant actors and postpone termination until longest possible interaction is necessarily over | Immediate non-coordinated triggered termination |
Actor Termination Execution | class_Actor:declareTermination/{1,2} | TargetActor ! delete |
Simulation time is fully decorrelated from wall-clock time, and is controlled by the time manager(s): the fundamental frequency of the simulation (e.g. 50Hz) leads to a unit time-step (a.k.a. simulation tick) to be defined (e.g. 20ms, in simulation time), each time-step lasting, in wall-clock time, for any duration needed so that all relevant actors can be evaluated for that tick.
If that wall-clock duration is smaller than the time-step (the simulation is "faster than the clock"), then the simulation can be interactive (i.e. it can be slowed down on purpose to stay on par with wall-clock time, allowing for example for some human interaction), otherwise it will be in batch mode (running as fast as possible).
A simulation tick is split into any number of logical moments, named diascas, which are used to solve causality and are not associated to any specific duration by themselves.
Both ticks and diascas are positive unbounded integers.
So a typical simulation timestamp is a tick/diasca pair, typically noted as {T,D}.
Controlling this simulation time means offering a scheduling service, here in a distributed way: it relies on a tree of time managers, each being in charge of a set of direct child managers and of local actors.
This scheduling service drives them time-wise, so that they all share the same notion of time (ticks and diascas alike), find a consensus on its flow, while still being able to evaluate all corresponding actors in parallel, in spite of their possible coupling.
In the most general terms, the behaviour of an actor is partly determined by what it would do by itself (its "spontaneous behaviour"), partly by the signals its environment sends to it, i.e. based on the messages that this actors receives from other actors (its "triggered behaviour").
In both cases, for an actor, developing its behaviour boils down to updating its state and/or sending messages to other actors, and possibly planning future spontaneous actions and/or sending information to probe(s).
Each actor, when created, has first its onFirstDiasca/2 actor oneway triggered [7]. This is the opportunity for this newly created actor to develop any immediate first behaviour, and also to specify at once when it is to be scheduled next for a spontaneous behaviour: otherwise, as all actor are created with an empty agenda, they would remain fully passive (never being spontaneously scheduled), at least until a first actor message (if any) is sent to them.
[7] | This actor actually receives the corresponding actor message sent by the load balancer, which determined a placement for it and created it. |
So all models are expected to define their onFirstDiasca/2 actor oneway [8], in which most of them will at least program their next spontaneous schedulings (see, in class_Actor, notably addSpontaneousTick/2 and addSpontaneousTicks/2). This corresponds, internally, to exchanges with the time managers in charge of the corresponding actors.
[8] | Knowing that the default implementation for onFirstDiasca/2, inherited from class_Actor, simply halts the simulation on error, purposely. |
Creations happen at the diasca level rather than at the tick level, so that any sequence of model-related operations (creation, deletion, action and interactions) can happen immediately (in virtual time), to avoid any time bias.
Then a very basic procedure will rule the life of each actor:
[9] | Actually there are other reasons for a diasca to be created, like the termination of an actor, but they are transparent for the model developer. |
Internally, these scheduling procedures are driven by message exchanges by actors and time managers:
The basic granularity in virtual time is the tick, further split on as many diascas as needed (logical moments).
The engine is able to automatically:
Indeed the simulation engine keeps track both of the sendings of actor messages [10] and of the planned future actions for each actor. It can thus determine, once a diasca is over, if all next diascas or even a number of ticks can be safely skipped, and then simply schedule the first next timestamp to come.
[10] | This is done on a fully distributed way (i.e. through the scheduling tree of time managers over computing nodes) and all communications between an actor and its time manager are purely local (i.e. they are by design on the same Erlang node). Moreover the messages themselves only go from the emitting actor to the recipient one: in each diasca, only the fact that the target actor received a first message is of interest, and this is reported only to its own, local time manager - the actual message is never sent to third parties (like a time manager), and no more notifications are sent by the receiving actor once the first message has been reported. So the number of messages, their payload and communication distance are reduced to a bare minimum. |
So, for any simulation tick, each actor may or may not be scheduled, and an actor will be scheduled iff:
The actor happens to be itself able to keep track of its expected schedulings, and thus it can automatically check that they indeed match exactly the ones driven by the time manager, for an increased safety.
Anyway these mechanisms are transparent to the model developer, who just has to know that all actor messages, once appropriately reordered, will be triggered on their target, and that the planned spontaneous schedulings will be enforced by the engine, according to the requests of each actor.
Thus the developer just has to define the various actor oneways that the model should support (i.e. the ones that other actors could trigger thanks to an actor message), and the spontaneous behaviour of that model (i.e. its actSpontaneous/1 oneway). Then the simulation engine takes care of the rest.
Each actor is able to specify, while being scheduled for any reason (an actor message having been received, and/or a spontaneous action taking place), at least one additional tick at which it should be spontaneously scheduled later. An actor can be scheduled for a spontaneous action up to once per tick.
To do so, it can rely on a very simple API, defined in class_Actor:
An actor may also decide instead to terminate, using declareTermination/{1,2} for that, once having withdrawn any spontaneous ticks that it had already planned [11].
[11] | The time management service could be able to determine by itself which ticks shall be withdrawn whenever an actor departs, however this operation would not be scalable at all (it would become prohibitively expensive as soon as there are many actors and/or many ticks planned for future actions). |
In a distributed context, on each computing host, the current working directory of the simulation is set automatically to a temporary root directory, which will be appropriately cleaned-up and re-created.
This root directory is in /tmp, to store all live data, deployed for the simulation or produced by it.
Its name starts with sim-diasca (to prevent clashes with other applications), then continues with the name of the simulation case (so that multiple cases can run in the same context), then finishes with the user name (so that multiple users can run the same cases on the same hosts with no interference).
Thus the root directory of a simulation on any host is named like:
/tmp/sim-diasca-<name of the simulation case>-<user name>
For example:
/tmp/sim-diasca-Sim-Diasca_Soda_Integration_Test-boudevil
This root directory has two sub-directories:
A simulator which added third-party data to the simulation archive (thanks to the additional_elements_to_deploy field of the deployment settings specified in the simulation case) is able to access to them thanks to class_Actor:get_deployed_root_directory/1.
For example, if the following was specified:
DeploymentSettings = #deployment_settings{ ... additional_elements_to_deploy = [ {"mock-simulators/soda-test",code}, {"mock-simulators/soda-test/src/soda_test.dat",data} ... }, ...
Then all models are able to access to the data file thanks to:
DataPath = file_utils:join( class_Actor:get_deployed_root_directory(State), "mock-simulators/soda-test/src/soda_test.dat" ), % Then open, read, parse, etc. at will.
On simulation success, all results will be appropriately generated (in a rather optimal, parallel, distributed way), then aggregated and sent over the network to the centralised result directory, created in the directory from which the simulation was launched, on the user host.
Finally, on simulation shutdown, the deployment base directory will be fully removed.
In this section, various technical pieces of information (not of interest for users, but relevant for engine maintainers) will be discussed regarding the mode of operation of Sim-Diasca.
By default (unless the case specifies otherwise), only the local host is involved, yet there are two VMs (Erlang virtual machines) running then: the one of the user node, and the one of a (local) computing node.
In the general case, distributed simulations running on N hosts will involve by default N+1 nodes: one user node (on the user host) and N computing nodes (including one on the user host).
See the computing_hosts field in the deployment_settings record (defined in class_DeploymentManager.hrl) for further options.
By default, long names are used for all Sim-Diasca related nodes.
To avoid any possible cross-talk, we have to ensure that a simulation (live or post-mortem) remains fully self-contained and immune to interferences (notably from other simulations that would be run in parallel or afterwards, i.e. both at runtime and regarding the on-disk generated information).
Erlang ensures (thanks to EPMD) that, on any given host, regardless of the Erlang installations, of their version, of the UNIX users involved, no two nodes can bear the same (long) name (otherwise the second node will fail to launch).
The Sim-Diasca user node is launched from the generic makefile infrastructure, resulting in the myriad/src/scripts/launch-erl.sh helper script to be run with proper parameters.
The name of such a user node (made a distributed node programmatically, see Node Naming) will follow the following format: Sim-Diasca-<name of the test or case>-<user name>-<simulation instance identifier>-user-node.
For example, a case named soda_deterministic_case.erl run by a user john and relying on a Simulation Instance Identifier equal to 43416933 will result in a user node named Sim-Diasca-soda_deterministic_case-john-43416933-user-node.
As for the (per-host) computing nodes, they will be launched each from their respective class_ComputingHostManager instance (driven by the class_DeploymentManager singleton created at start-up), and their name will follow that format:
Sim-Diasca-<name of the test or case>-<user name>-<simulation instance identifier>-computing-node-on-<hostname>
The same example, running on host volt, will thus result in a computing node to be created under the name:
Sim-Diasca-soda_deterministic_case-john-43416933-computing-node-on-volt
This information, whose shorthand is SII, is a string made of alphanumerical characters, dashes (-) and underscores (_), and is meant to guarantee the uniqueness of a given instance of a simulation (i.e. no two simulations, run simultaneously or not, should ever bear the same SII).
By default the SII is automatically generated and managed by Sim-Diasca. It is based on a UUID (Universally unique identifier) determined at start-up [12]. An example of a UUID is 4f8fbacd-93d2-487d-86ff-23f75339c191.
[12] | A UUID is obtained thanks to any system-provided uuidgen command, otherwise our own implementation is used for that (refer, in the Myriad layer, to basic_utils:generate_uuid/0). |
As the acronym suggests, there is very little chance that two simulation instances may succeed in drawing the same UUID, so they provide an adequate guarantee of uniqueness.
These UUIDs may be deemed a bit too long to be very tractable for humans, so the engine shortens them thanks to hashing (thanks to erlang:phash2/1), hopefully preserving a sufficient part of their underlying unicity.
The hash value of our example UUID corresponds to the 43416933 identifier in the previous section.
While such an automatic identification is convenient and transparent, for some uses it is possible and even desirable not to rely on such a randomly determined identifier, but to use instead one that is transmitted by a third party (typically if the engine is embedded in a simulation platform able to provide its own identifiers).
Then Sim-Diasca is able to use such an externally-obtained identifier thanks to its --simulation-instance-id command line option.
As a result, such a platform may run a simulation case with:
$ make my_foobar_case CMD_LINE_OPT="--batch --simulation-instance-id 117"
Then the specified SII will be used instead of the one that would be determined internally, at runtime, notably to designate:
So the SII is either user-supplied or determined at runtime, by the engine itself. As a result, the name of the user node cannot be determined statically in the general case (the node must run so that it may draw its UUID then determine its SII).
Knowing that additionally a node created as a distributed one (here with the "long names" command-line option) cannot be renamed (net_kernel:stop/0 not allowed), the only relevant design is, from the Sim-Diasca layer onward (lower ones relying on long names) to start the user-node in non-distributed mode (thanks to the --nn option of launch-erl.sh), establish the name it shall bear, and then only execute net_kernel:start/1.
The naming of nodes is a first-level security, which should prevent most accidental collisions.
If ever all other safety measures failed for any reason, a node naming clash will happen, yet it will be detected and will lead to making the clashing simulations fail, so no silent failure shall be feared.
This protection is obtained thanks to Erlang cookies using transparently the UUID mentioned in the previous section (UUIDs will be used in all cases for cookies, even if a third-party SII is specified - for an increased safety, should clashing SIIs be provided by mistake).
We mention here the main technical sources of puzzlement that may affect the unwary developer.
This may happen if the code source has been changed yet has not been recompiled before launching a simulation: Sim-Diasca, once executed, will attempt to compile it, and hopefully succeed.
Then the corresponding BEAM modules will be available in their newer version and, when they will be referenced for the first time, they will be loaded - thus in their newer form.
However, some modules may have already been loaded by the engine (for its internal mode of operation), before it triggered the compilation update.
As a result, even if a newer version of their BEAM file becomes available on disk, these modules have already been loaded (and will not be specifically reloaded) [13] ; they will thus stick to their former version, and their newer version will be loaded only at the next Sim-Diasca run.
[13] | Moreover, they may belong to the pioneer modules, in which case they will be deployed over the network on other nodes as well, instead of being read in an updated version from the simulation archive. |
A solution is simply to ensure that any module whose source has been modified is recompiled afterwards (simply a matter of typing CTRL-P with our emacs settings), at least before a new simulation is run.
Writing models involves a lot of algorithmic design decisions, and many of them deal with data-structures.
There is a large choice of both data-structures as such (lists, trees, associative tables, heap, etc.) and of implementations (gb_tree, hashtables, etc.), offering various trade-offs.
In some occasions, we felt the need to develop our own versions of some of them (see myriad/src/data-management for the most common ones), even if in some cases built-in solutions could provide better trade-offs (thinking to ETS tables- albeit offering a different sharing semantics - and to the process dictionary - which is not the purest and most flexible feature we wanted to rely on).
We nevertheless use most of the time the built-in data-structures, like gb_sets or queues. When multiple implementations providing the same API are available (e.g. for ordered lists), we usually define a (sometimes module-specific) list_impl symbol, allowing to switch easily between similar datastructures.
For example:
% Defines list_impl: -include("data_types.hrl"). f( A ) -> true = ?list_impl:is_empty( A ), [...]
Similarly, for algorithms operating on these data-structures we tend not to reinvent the wheel (e.g. class_Mesh uses the digraph module), unless we need specific versions of them (e.g. operating on an implicit graph, with user-specified anonymous functions, see myriad/src/utils/graph_utils.erl).
Making use of large-scale HPC infrastructures is not straightforward: they often behave like black boxes, and because of the number and specificities of the hardware and software elements that are involved, they tend to exhibit unanticipated behaviours.
Here is a list of recommended steps to go through, in order to have a better chance of making good use of these resources:
deactivate the sending of simulation traces, that would otherwise overwhelm the trace aggregator: comment-out, in traces/src/class_TraceEmitter.hrl: %-define(TracingActivated,)
activate any console outputs you are interested in (uncomment relevant io:format calls):
increase the time-outs:
copy the source of the simulator and update the configuration files according to the case to be run
compile everything from the root, from scratch (make clean all)
for all the classes for which traces are wanted (if any), re-enable their sending:
re-activate traces, reverting the content of traces/src/class_TraceEmitter.hrl
update the time-stamps of all target classes, e.g.:
touch sim-diasca/src/core/src/scheduling/class_DeploymentManager.erl
re-compile from the root (make all) if traces were enabled for at least one class
possibly: hide ~/.ssh/known_hosts to avoid nodes being rejected because of a change in the RSA fingerprint of their key
launch in debug mode from the front-end, e.g.
sim-diasca/conf/clusters/sim-diasca-launcher.sh --debug --node-count 32 --cores-per-node 8 --queue parall_256 --max-duration 64 foobar-simulator/src/uc23_integration_test.erl
Adding a type specification (shorthand: spec) to the source code of a function means specifying what are its intended input and output parameters, in terms of number and types. This can be applied to records as well.
Once all Sim-Diasca code (including prerequisites, like Myriad, WOOPER and Traces) and the one of user applications based on it (e.g. Mock-Simulators or any actual simulator) have been instrumented with type specifications (i.e. when all exported functions and records of all modules have a proper spec), then:
Some versions of Sim-Diasca are referenced in Dialyzer's application repository.
First of all, type information must have been already extracted from the Erlang/OTP files of the install that will be used, and stored in a PLT (for Persistent Lookup Table) file for later-reuse (this is a preprocessing stage). Such a file is preferably created one time for all for each Erlang environment being used. Therefore a PLT file is better produced as the last step of an Erlang installation (see the --generate-plt option of our install-erlang.sh script, which streamlines it). This operation is rather long (e.g. one hour and a half).
We prefer to have Dialyzer operate on BEAM files (*.beam) rather than on source files (*.hrl/*.erl), as include paths, symbol definitions and parse transforms are better supported this way.
These BEAM files must have been compiled with debug information (i.e. with the +debug_info compiler option). This is thus the default enforced for the full Sim-Diasca software stack.
When writing type specifications, one must know what are the built-in ones, in order to re-use them, so that they do not end up being defined more than once, under different names. To do so, one may use our myriad/src/scripts/list-available-types.sh script, like in:
$ cd otp_src_RxBy $ list-available-types.sh | tee declared-types-in-Erlang-RxBy.txt
Once a PLT is available for Erlang/OTP, PLTs are to be generated for the entire codebase of interest (typically Sim-Diasca and its prerequisites, and possibly user code as well).
This can be achieved with the generate-all-plt make target, to be run from the root of either a check-out or an install. The script will climb our software stack layer by layer, and generate for each a custom PLT (e.g. myriad.plt).
If, for any reason, the PLT of a layer must be (re)generated, simply use the generate-local-plt make target from the root of this layer.
The generation of a PLT will notably allow to catch discrepancies of calls with regard to the specs of the functions being called.
Checking specs against the functions they apply to is useful as well. This can be done on any layer through the self-check-against-plt make target.
Like for Erlang, for each layer a repository of the type declarations defined there can be built, either by running make generate-list-of-local-types from the root of that layer (producing then a declared-types-in-<LAYER>.txt file), or by running make generate-list-of-all-types from the Sim-Diasca root to have all lists of types generated at once.
The complete syntax is described here. This will be the main reference to be used and kept ready when writing type specs.
Following conventions are to respect:
For a larger codebase to instrument with type specs, it may be useful to start first with the specs that can be deduced by Dialyzer from the actual code of functions. This can be done thanks to our add-deduced-type-specs.escript script (in myriad/src/scripts). One should note that these specs are not, in the general case, the ones that the developer would have written (as Dialyzer cannot guess the intent of the original developer), so at least some adaptation work remains (e.g. to define reusable types).
Note
A developer may overspec or underspec.
Overspecification corresponds to the writing of type specifications that are narrower than the allowed types that a function could process. For example, even if a given function happened to be able to use improper lists as well, the developer may decide that only proper lists are to be passed. Similarly, one may prefer string() to [[any()] | char()].
Reciprocally, underspecification corresponds to the writing of type specifications that are larger than the allowed types that a function could process. This may happen if planning to expand later the inputs that a function can take into account.
Overspecification is perfectly legitimate, whereas underspecification should preferably be avoided.
In the context of each layer, one may routinely run:
$ make clean all generate-local-plt
This allows to list all the types that are unknown (generally misspelled or not exported) and spot a few kinds of errors (e.g. Call to missing or unexported function).
For a layer foo (e.g. Myriad, WOOPER, etc.), one should run from its root directory:
$ make self-check-against-plt
You will have an output like:
$ make self-check-against-plt Building all, in parallel over 8 core(s), from BASE/foo [..] Checking foo against its PLT (./foo.plt) Checking whether the PLT ./foo.plt is up-to-date... yes Compiling some key modules to native code... done in 0m29.49s Proceeding with analysis... bar.erl:53: Function run/0 has no local return [..]
Issues can then be tackled one by one. To speed up the process of improving a module bar, one can run:
$ make bar.plt Checking module 'bar.beam' against relevant PLT [...]
And only this module will be checked, allowing to fix them one by one.
Note
When a source file is modified, the rebuild the BEAM must be triggered specifically, otherwise Dialyzer will not detect that its PLT is not up-to-date anymore (it relies on the timestamp of the BEAM file, not on the one of the *.erl file).
Special thanks to Randall Munroe who is the author of all the comic strips that enliven this documentation, and who kindly allowed their use in this material.
See his XKCD website for all information, including for all his delightful other strips.
Congratulations, you reached the end of this developer guide!
Now you should write your own test models, getting inspiration from the Sim-Diasca Mock Simulators (see the top-level mock-simulators directory in the source archive).
Writing such toy models is surely the shortest path to the understanding of Sim-Diasca conventions, no matter how fruitless an activity it may seem:
We hope that you will enjoy using Sim-Diasca. As always, any (constructive!) feedback is welcome (use the email address at top of this document for that). Thanks!