I am a US citizen, but have been living in Germany since 1989.
MY CORE COMPETENCE
– Full project life cycle experience as a developer, SW architect, team leader and manager
– Introduction of evolutionary improvements in tooling and processes to accelerate development workflow and improve quality
– Coordinate international, multi-company software development, building a strong partnership to produce collaborative results
– Visionary and team builder, skilled at achieving buy-in from stakeholders
– intacs® certified Provisional ASPICE Assessor
– Diplom Mathematik, 1981, Georgia Augusta Universität, Göttingen, Gemany; Thesis title, „An Application of Fourier Analysis in Visual Pattern Recognition“. Scholarship from Max Plank, Institute for Biophysical Chemistry, Göttingen, Germany
– BS Mathematics, 1976, U.C. Irvine California; Internal Publication of the Physical Sciences Department, “An APL Workspace for Plotting Surfaces”
– Functional Safety Practitioner
Jan, 2017 – present
In 2016 it became clear that the position at Visteon had nothing more to offer, so I started to cast about for alternatives. I quickly decided that I no longer wanted to be an employee and became an independent consultant.
EVOLUTION OF EXISTING DEVELOPMENT PROCESSES
In the past two years I have been involved in the task of evaluating development processes both to provide support with the rollout and to identify improvements to achieve better ASPICE compliance. The task of process definition and improvement is an important and essential starting point, but I have seen that it is only the starting point.
Engineers often perceive changes in the development process (and indirectly, the quality dept) as vampires consuming their valuable working time. It is easy to find engineers that are unsatisfied and can provide a long list of (in their eyes) senseless time-consuming process demands that hinder them in their core task of development. These engineers rarely propose solutions that are more than a local optimum for individual developers or teams. More general solutions will often encounter problems with acceptance during rollout.
It is essential to define an evolutionary path that will allow the developers to transition from their current way of working to a new way of working. The first milestone along the way must be to achieve acceptance from the engineers. If the engineers don’t accept these changes, if they perceive the changes as a problem, they will find a workaround. This doesn’t help the organization as a whole.
It is possible to identify key activities in the development flow that can be improved in isolation. Once this is achieved, then increasingly stringent demands can be imposed on the inputs. One such key activity is the integration. If a solid integration strategy can be developed, this provides a starting point to demand better quality from the inputs to integration. Better architecture makes incremental improvement of the integration strategy possible. Better testing from the component and subsystem developers results in earlier detection of flaws and faster integrations. Furthermore, a clean integration strategy allows a tighter integration with testing to ensure stronger testing during and after the integration.
In summary: Don’t simply change the development process, evolve it. When larger changes are necessary, deal with one focus area after another so as to keep the changes manageable. And don’t just define the process changes, define an introduction strategy as well.
EVALUATION OF ENGINEERING COMPETENCE
I was fortunate to be given a fascinating task in early summer of 2018. Company A (non-German) wanted to purchase company B (German). As part of their due-diligence process, I was asked to perform an evaluation of the engineering competence within Company B.
Although this was by no means an ASPICE assessment, the ASPICE model nevertheless provided valuable orientation for areas of focus and allowed me to structure the list of questions that I wished to address. Of course, as the saying goes, no strategy survives the first encounter. I didn’t work through my list of questions in quite the structured fashion I had intended. But I was very impressed with some of the work being done in Company B and made a positive recommendation. They were taken over and according to my last reports are quite successful together. I wish them all the best for the future.
REVIEW AND ANALYSIS OF INFOTAINMENT SOFTWARE ARCHITECTURE
A European OEM contracted with me to review the software architecture of an infotainment system being built by a German Tier1. They wished to have answers to two questions:
1. Does the planned architecture have any potential gaps or problems of which the OEM needs to be aware?
2. Are the customer requirements fulfilled by the architecture?
Several interesting and disturbing observations caught my attention during this contract. One observation was that much of the documentation was not available because it had not yet been generated. The explanation from the Tier1 was that their platform and their developers are so good that they don’t need to do a separate software architecture, they simply re-use and adapt their platform. In fact, the software architecture documentation is generally only done when a customer insists. A second observation was the insistence of this Tier1 that their development process is “fully” ASPICE compliant. Needless to say, one key output of ASPICE, the bi-directional traceability of customer requirements through the system level to software level was not demonstrable.
After two months of pushing for documentation and doing reviews of everything which was available, this contract ended when the Tier1 reclassified all their documentation as “top secret” with the limitation that access is limited to employees of the OEM.
All in all, a very interesting and educational experience that illustrates the vast differences between belief and reality with the example of ASPICE compliance.
PATTERNS OF SOFTWARE RE-USE
One of the recurring themes is the question of software re-use. It is pursued as a holy grail of software development. Unfortunately, effort is rarely invested in making re-use efficient. Several dominant patterns of software development (and re-use) can be observed.
Pattern 1: Every man for himself.
In this pattern, every project is treated as a one-off. Unique development without any need to be compliant with legacy software, is lots of fun and extremely creative. The engineers will love it. However, the project budget implicitly imposes a limit to the level of complexity that can be achieved. That is why this pattern is rarely seen in large projects.
Pattern 2: Clone and Own
In this pattern every new project starts off by identifying an existing project which is as close as possible to the goals of the new project. The new project will then clone all the software from the existing project, assume ownership of the copy and start adapting the code to their needs. One problem that results is the loss of synergy and coherence across project development lines. Projects benefit from each other only in their start, afterwards they are competitors fighting for access to developer time.
Pattern 3: Pool-based development
In the “pools” pattern groups exist that are responsible for a specific technology or subsystem across all projects. Such a group will create an internal platform which will be tailored and extended as needed to satisfy the needs of their client projects. Such pools are often reliability and stability leaders in projects because of their approach.
Pattern 4: Platform-based Development
Platform-based development is a natural extension of the pool concept. One challenge is to identify and distill the common elements across multiple projects into the platform. A second challenge is to provide the configuration points that are needed to adapt to individual customer wishes when needed. A third challenge is to organize all of the independent pools into a single independent platform project. To be successful, a platform must be structured as a project which is funded and managed independently of the customer projects. The platform project must have a clearly defined and well managed supplier relationship with the customer projects which it supports.
Pattern 1 is most often seen in small engineering support companies. Large organizations tend to favor pattern 2 (clone and own) with pattern 3 occasionally appearing. Pattern 4 is the most powerful pattern but is the most challenging. The company budget structure will typically show the platform project as a cost center. It is rare for upper management to look below the surface and to recognize the benefits and savings this offers for the customer projects.
nee Johnson Controls Automotive Electronics: Dec, 2012 – Dec, 2016
In Dec, 2012, I started work at Johnson Controls Automotive Electronics. In March 2013 Johnson Controls announced that they wished to sell their Automotive Electronics division. The sale was finalized and Visteon assumed ownership in Jul, 2014.
WHO NEEDS A LINUX STRATEGY?
There is a pink elephant in the room that very few people are talking about. The pink elephant is the question of a Linux strategy for automotive. More specifically, I am thinking about the question of long-term support for devices in the automobile that use the Linux OS.
A vast amount of software development for automotive is Linux based. Many of the companies that are working on autonomous driving are newcomers and have no inhibitions about breaking traditions. Doing their development on Linux is an obvious decision that saves an enormous amount of extra work and costs.
Many of the OEMs are also looking towards Linux. The Genivi Initiative was originally started by the OEMs in an effort to commoditize software components (e.g. for infotainment) by providing a standardized environment.
When SoCs are purchased almost every supplier will provide a Linux distribution that will run immediately on their reference boards and which can be easily adapted for proprietary boards that use this SoC.
The advantages to using Linux would seem to be obvious. Why is this a problem?
Linux can become a source of security problems. Many people that are concerned with security are actually strong advocates of Linux because it offers an array of mechanisms that support security (when correctly configured). The problem is the tempo of Linux development when compared with the tempo of automotive development. A new Linux release is generated roughly every three months. Every Linux release contains some security-related improvements. This means that older Linux releases must be viewed as being increasingly vulnerable as time goes on. To remain secure, the underlying Linux distribution must be updated regularly.
Let’s compare this with the tempo of automotive development. A typical infotainment system will have a development cycle of 2 to 3 years. At some point in that cycle (no later than 6 months before the start of production) the OS version will be finalized. When the device goes into production it will be produced and installed in new cars for anywhere from 4 to 6 years. Once the vehicles have been sold, they will remain in use for an average of 8 years. This means that …
- … the very first systems produced will have a Linux which is at least 2 releases old
… the last systems produced will have a Linux which is 18 – 26 releases old
- … vehicles out in the field will have a Linux which can be 58 releases old or more
Does an update strategy exist which can ameliorate this problem?
When you speak with SoC suppliers, their strategy is based on LTS Linux distributions which have a support lifetime of 2 to 3 years. Furthermore, when the next generation chip comes out, the Linux distribution for the older generation chips will no longer be supported.
When you speak with OEMs they are willing to acknowledge the problem. But an update of the OS will result in a need to perform extensive qualification testing on the system in the vehicle which is very expensive. Furthermore, a better solution is needed for the logistical problem of updating systems in the field. Over-the-air update (which works fine for mobile phones) might offer a solution here. The costs of qualification testing remain.
The Tier1 suppliers are caught in the middle and will need to provide essential parts of the solution. The suppliers must have a strategy for support of their products (their vehicle component) which can be extended to include an option for regular (e.g. annual or bi-annual) updates of Linux over a period of 12 to 20 years. Each update must bring the installed Linux distribution up to a level which is close to the Linux head revision.
When I was with Visteon, we addressed this problem by taking ownership of our own Linux distribution. This was possible with a surprisingly small team of good engineers. One result was that we were far ahead of the Linux distribution provided by the SoC supplier. Another result was that we could realistically offer long term support including Linux updates for our products.
Unfortunately, like many other good ideas, this was lost when the department was shut down.
FRAMEWORK VS. PLATFORM
Something that has always irritated me is the nearly interchangeable way in which the two concepts of “framework” and “platform” are used. It appears that very few people have a clear idea of what a framework is, what a platform is and what the differences are. Let me take a stab at providing an easy definition.
Think of a box of Lego™ blocks that a child has in a playroom. That box of blocks is the platform. Every individual block is a component of the platform. By assembling the blocks in different patterns, a vast range of wonderful things can be created.
So where is the framework in this description? The answer is that it is hidden in plain sight. The framework is the connector technology (the knobs on top of the blocks) that allow them to be stuck together so easily and flexibly. It’s possible to build things without those knobs (e.g. with Jenga™ blocks) but it won’t hold together as well and won’t be as stable.
Thus, in an architecture diagram, the framework may not be visible. It will be implicit in the way that the components and their interfaces are constructed. It may appear in the tooling that is used (e.g. graphical editors and code generators). The bottom line is that the framework is an enabler for a more powerful platform. By providing standardization in a key area, it makes platform creation and maintenance vastly easier.
Frameworks are incredibly sexy! Much more so than platforms.
HEAD OF INFOTAINMENT SOFTWARE PLATFORM
During my time as the head of the infotainment software platform within JCAE/Visteon I was privileged to work together with an extremely innovative team and was witness to creative interaction at its best.
Heiko Oehring (now at Protos) led the group that was responsible for the Coma application framework. This application framework was the foundation of our platform development within Visteon. It was accompanied by a vast array of tools that automated and simplified the daily tasks of the developers.
Waheed Ahmed led the hypervisor/OS-BSP group. He was one of the lead visionaries in the development of the Visteon hypervisor which was the foundation of the Visteon Smart-Core™ technology. I confess that I was initially an opponent. My opinion was that it is foolish for a company such as Visteon to develop their own hypervisor, this is clearly something that should be purchased from an experienced supplier. I was wrong. The Type0 hypervisor that was created was truly new and was a tremendous simplification for functional safety.
Codethink is a consulting company in the UK with the motto “We provide genius”. I can confirm from my experience that their consultants are worth every penny that you pay, they truly are geniuses. Their special area of focus is open-source software and they were critically helpful in our hypervisor development, the development of our Linux distribution and the development of tooling to maintain our Linux distribution.
Jul, 1999 – Nov, 2012
IT WAS THE ALPHA-PARTICLES!
We had field returns of infotainment systems that no longer started. Each system failed to start in a unique fashion, but they all failed somewhere during the boot process. After re-installing the boot code in the NOR-flash storage device, they all worked perfectly. When we read the binary image out of the NOR-flash and compared it with the image that should have been there, we were always able to find a single bit somewhere that was changed, and the change was always from zero to one (bit erased).
We spent quite some time searching for possible causes in both software and hardware. However, we eventually had the insight that this phenomenon couldn’t have an accidental cause external to the storage device. Individual bits could be programmed (changed from one to zero), but individual bits couldn’t be erased (changed from zero to one). Erase was only supported for entire segments of 128kbyte (on this device). To erase a single bit from outside the device one must read an entire segment, change the targeted bit in the copy, erase the segment and reprogram it with the modified copy. This can’t happen accidentally, it could only be a result of malicious intent. Our conclusion was, that it must be some mechanism which is internal to the NOR-flash device.
Eventually the supplier informed us that the root cause were alpha particles generated within the housing of the chip. The silicon-oxide used in the housing, despite extensive cleansing, had a remaining contamination with naturally occurring uranium and thorium in the range of just a few ppb. This was sufficient to generate one alpha particle every thousand hours per square cm of chip surface area. If the alpha particle had sufficient energy to penetrate into the data storage layer and if it happened to impact a storage cell containing charge, the charge was dissipated, erasing that bit. If that bit happened to be in a section of code or data that impacted the boot process, the system would no longer start.
The solution was to switch over to an equivalent device whose housing was made from materials obtained from a different, cleaner source.
THE PLUGS WERE DIRTY…
In our new generation infotainment systems we were making use of hard disks. Personally, I found this quite courageous. When I think about the read/write head flying over the disk while driving along an unpaved road, I cannot help cringing.
It worked very well. My hat is off to the disk manufacturers. We actually had the most problems with the spindle lubrication which tended to harden at -20C, so that the disks couldn’t spin up when power was applied. They first needed to warm up…
Then one day we started having problems with data corruption. We were having odd and random failures as a result. We determined that data was being incorrectly written to the disks but also that data was being incorrectly read back. Then one day serendipity helped out. We were testing for possible cabling flaws and had the idea that the plug connection could be a problem.
After inspecting the plugs under a microscope, it became clear that the problem was caused by plug contamination. If a microscopic, partially conductive dust fiber happened to get into the plug connection, it could result in an intermittent short circuit between two connections. Thus, two bits in the parallel connector could be short-circuited, corrupting the data within the transportation path. The disk itself, with all of its internal ECC and checksum support, had no chance to perceive or correct the errors.
The fix was to ensure that the assembly was performed in a clean room environment. Problem solved!
WE LOST TWO BYTES…
The MOST bus is an optical bus technology, originally developed within Harman, which is optimized for multi-media applications within the automobile. In 2000 I was working on a project for a German OEM in which the MOST bus was causing some problems.
Next to the synchronous multi-media channels, the MOST bus also supports a command channel and an asynchronous channel for transfer of larger data packages. This asynchronous channel was being used to send tiles for the map display from the external navigation processor to the head unit. In order to accelerate communication (by reducing interrupt loading) an FPGA was used accumulate data packages from MOST controller, only generating an interrupt when its buffer was full. It took quite a bit of time to debug the FPGA code and required close cooperation between the software developers and the FPGA firmware developers. (BTW: This is also when I developed a profound hatred for conditional compilation using inline compiler flags – but that’s a different story.)
We finally got everything to work and it was – almost – perfect. If we tried to use the full FPGA buffer size for asynchronous communication, we would still lose an occasional map tile. Sad! We were forced to use only a small fraction of the FPGA buffer which slowed the communication enormously.
One of my colleagues spent some time on the problem and determined that the tiles were not truly being lost. What happened was that the lower two bytes of the checksum were lost (zeros appeared in their place) resulting in an application-level checksum failure, retries and finally exhaustion of the retry-count.
This observation from my colleague was the key to understanding the problem. The map tiles were being segmented on the sender side at several levels, with the FPGA buffer size being the top level and the size of the asynchronous channel in a single MOST frame being the lowest level. If the segmentation cascade resulted in the final MOST frame only containing two bytes, the FPGA read the wrong two bytes out of its buffer. The size of the data package was the deciding factor that resulted in corruption. We tested our hypothesis by changing the boundary value, which changes the size of the asynchronous channel in the MOST frames. As predicted, this changed the size(s) at which a data block would be corrupted. The actual fix in the FPGA was trivial, as is often the case.
THE BASIC PRINCIPLES OF ONE-CLICK BUILD
Have you ever heard of the “Joel Test”? Question two in the list asks: “Can you make a build in one step”?
Why should I care?
Every developer should be held responsible for what he delivers. It is irresponsible to deliver software for integration into a product if the developer cannot provide any guarantees that it functions. However, in order to test, the developer must be able to build. In some environments the procedure for building a testable binary can be complex, if not esoteric. The more manual steps involved in making a build, the more errors that are likely to be made. The ultimate goal for a reliable build process is obviously the one-click build.
I became painfully aware of this problem in 2001 when I was the integrator for an infotainment project for a large OEM. Our build process was far too complex, based on a long sequence of individual steps, all driven via a command-line interface. We had a deadline, needed to get some critical bug fixes integrated and needed a testable binary for evaluation. Towards the end of an extremely long working day that stretched far into the night, I attempted to successfully complete a final integration. After numerous failed attempts I finally gave up at 4:00AM and went home to sleep. To this day I don’t really know if it was a problem in the deliveries or an error that I made in performing the integration.
It was a hellish experience.
In the following weeks I assembled a Perl script which automated this process. Over the course of time this script was extended, made configurable, modified to be usable both interactively and when invoked from a script and finally (by my colleagues) recoded in Java. The build steps that were consolidated into a single action included:
1. Synchronization of the local workspace with a specified baseline from the code control system
2. Code generation
3. Compilation (for a chosen target architecture)
4. Link, both of the software product and of unit tests
5. Static code quality analysis (e.g. PcLint or other) with report generation
6. Extraction of documentation (e.g. Doxygen or other)
7. Execution of Unit tests (when building for a target that allows automation of tests)
We chose not to automatically archive the build workspace or the results. Our choice was to always require a human to make the decision if archiving was appropriate.
After several years, this concept had diffused throughout the organization and into all customer projects. Every developer was able to use this tooling on his own workstation make local builds for testing. Every integration was automated, starting with a collection of builds for different supported target architectures (Linux, QNX, Windows, X86, other SoC’s, etc).
This change in tooling was a key change that enabled several associated changes.
A common build system used across all projects and groups made results more easily comparable and reduced the learning curve for new hires.
It was possible to make unit testing and its results visible and reproducible, resulting in an effective program to improve both quality and coverage.
It was possible to start an effective program for controlled reduction of errors and warnings generated by the static code analysis.
It was more easily possible to evolve the development environment (choice of code control tooling, change of compiler, linker, change from use of “make” to an alternative build tool, etc).
The ultimate lesson learned is to seek simple key changes that make daily work easier.
THE DISADVANTAGES OF “MAKE”
Make is a powerful tool and I have a love-hate relationship with it. My first encounter with make was in 1983. My first impression was that the syntax of makefiles is funky and abstruse. I was glad that my colleague assumed responsibility for managing the makefile and was thankful for the simplicity that it offered in building our software.
Between 1989 and 1999 I also had a chance to acquaint myself with imake (which still has make underneath). I was almost excited. The syntax was much easier to understand and use, but that was accomplished by hiding everything in the imake configuration files. Those were nearly incomprehensible. But it also had the advantage, that multiple different target architectures could be transparently supported.
In my years at Harman I initially tried to introduce imake. The motivation was our need to support multiple numerous different combinations of operating system and underlying processor. Imake worked but was generally disliked because of the complexity of the configuration. However, we were increasingly confronted with another limitation resulting from the way we used make.
We were building large systems, spread over a large tree of directories. A typical solution that is widely used is to call make recursively to process all subdirectories before processing the current directory. The problem with this is threefold. First, the recursive calls to make require additional processing time to create and tear-down the separate processes in which they are performed. Secondly, the order in which subdirectories is processed is statically defined in the local (i)makefile. Finally, knowledge of dependencies (from which the correct processing order can be derived) is only locally available. This makes partial builds unreliable. The only way to be certain that everything is correct is to do a clean build. For a large system this is an unacceptable waste of valuable developer time.
After discussion, we ended up switching to JAM. This seemed to combine an overall simpler syntax with the advantages of imake. Most importantly, it resolved the problems that we had with make. The final deciding factors were that partial builds became fully reliable and that they typically completed faster – primarily due to the elimination of the recursive calls to make. A savings of 30 seconds per build adds up very quickly across large development organizations.
I like make – but it has its limitations.
A RANT AGAINST 100% SOLUTIONS
I live in Germany. I like Germany. But some things drive me nuts! One aspect is that German engineers seem to be reluctant to accept 80% solutions, they always seem to push for 100% solutions.
During the evolution of our platform concept within Harman, we needed to harmonize (harmanize?) the way in which our source code was structured. We had an initial structure which was working adequately, but several drawbacks became visible which sometimes resulted in confusion and avoidable errors. Furthermore, the developers complained loudly that they spent too much time clicking through the structure to find the directory in which they needed to work.
I had an idea for a change which was both minimal and would largely eliminate the observed problems. It was a typical 80% solution.
My colleagues were not convinced. “Just imagine if this change is introduced and shortly after its introduction we develop a better solution. Then we will need to change everything all over again. And we know that this minimal solution is not perfect because it doesn’t include extensions for all future architectures that we may need to support.”
So my colleagues eagerly started to work on the development of the ultimate concept for code archive structure. After roughly 18 months they finally had a proposal. It required a dramatic restructuring of the existing code archive but offered all the bells and whistles needed to be completely future proof.
After it was introduced, the developers spent even more time clicking through significantly more deeply nested directory structures. IMHO the code structure became more difficult to understand and manage due to the choice to strictly separate interfaces from implementations in separate directory trees.
An excellent example of an ultimate solution which failed to address the original problems.
A CONCEPT FOR MODULAR SOFTWARE DEVELOPMENT AND ARCHITECTURE – MOCCAV2
When I joined Harman in 1999 they were making four major changes in their software development all at once. They were moving from 16-bit to 32-bit processors, moving from main-loop programming to the use of an RTOS and moving from the use of assembler with a bit of C to C/C++. Parallel to these efforts, I was asked to participate in the development of an application framework (MMI2000).
We were successful, the framework was of great assistance and was sufficient foundation for two generations of infotainment systems. At that point its weaknesses started to become apparent and the search for a successor was on.
In 2002 I was tasked with specifying a next generation application framework. After several months of work with strong support from the lead architects throughout Germany, we were able to put a proposal together which was then approved by Dr. Geiger (our CEO). In the following years I was privileged to lead the team tasked with developing this framework.
Our framework was based on five underlying abstractions and principles:
1. The framework must provide an OS abstraction layer (OSAL) so that it will function equally well under Windows, QNX and Linux.
2. Interfaces between components are treated as independent architectural entities. They follow well-defined structural and behavior patterns and are described with a description language from which executable code is generated. The code generation follows a pattern similar to CORBA with the server and client side being separately implemented.
3. Components can be described as containers. They contain a list of exported (supported) interfaces, a list of imported (required) interfaces and an algorithm which implements the functionality of the component. Components are described with a description language from which executable code is generated.
4. Logical deployment is the process of providing connections between components. A connection will bind an imported interface of one component (the client) with an exported interface of another component (the server). Again, this is described with a description language.
5. Physical deployment is the work of defining the processes which will be linked, the threads that will run in those processes and the assignment of components to threads. This too was described with a description language from which executable code and linker files are generated.
The result of these abstractions was threefold. Several important classes of problems could be eliminated from the final system because it was possible to do formal checking using the descriptions that were provided. One example was that there were no more interface version conflicts, because this was detected in the logical deployment. The second important result was to allow changes in logical and physical deployment to be made in minutes without having to change any source code. The supporting tooling made such changes extremely fast and easy. Then after a rebuild, the changes were baked into the system. Finally, the formal description of components and their interfaces ensured that they were easily re-usable in new projects.
This is a short and simplified description of what MoCCAv2 encompassed. However, it went on to become the foundation upon which the Harman platform concept was built. This platform was a key internal supplier for all Harman infotainment projects up through 2012 when I left Harman.
It still saddens me deeply, that I was forbidden to pursue the release of this framework as an open source project under the umbrella of the Eclipse Foundation. Our discussions had progressed quite far before I was directed to break off communication. This would have had the potential to become an effective industry standard had the management within Harman not decided that it was more important to protect their IP.
May, 1989 – Jun, 1999
When I left Philips Medical Systems in 1999, it was to move to Germany for personal reasons. I was fortunate in being able to find a new employer that was strong in MRI, but far enough away from mainstream clinical applications to avoid problems with the non-competition considerations.
WHAT IS A PARAMETER HANDLER AND WHY SHOULD I CARE?
Simplicity can take on many forms. One important goal should always be to offer simplicity to the users of a system or product. This can become difficult when the users have vastly different interests. Magnetic resonance imaging (MRI) offers one example.
Radiologist often have a limited understanding of how MRI works. They want a description of an MRI measurement in terms of a list of standard images, having known contrast properties, oriented and positioned to show the suspected problem area within the body. The key observation is that their reference is the patient’s body and they are focused on the images as an end product.
The physicists that develop the measurement methods used in MRI use the magnet as their reference. They describe the images using the B0 magnetic field, the x-, y-, and z-gradient fields, the RF pulses and the ADC used for data sampling – all orchestrated with a timing precision that still fills me with awe. The key observation is that their reference is the magnet and their focus is on the physical process that needs to be followed to obtain the data from which the images can be constructed.
The engineers that develop an MRI system have yet another point of view. They must use the physicist’s description (how to make the measurement), combine this correctly with the radiologist’s description (which images are to be produced) and turn this into a precisely correct stream of commands to be sent to the controls for the gradient magnets, the RF transmitter, the RF receiver, the image reconstruction and other parts of the system. Their reference is the control of the electronics with the required timing constraints and their focus is on acquiring the required data and creating the images.
All three points of view must come together precisely so that the system can achieve its purpose.
The parameter handler that I developed within Bruker had the purpose of bringing these different points of view together. It allowed a measurement method to be described in a fashion that made all three of these viewpoints accessible and easily usable. The physicists were able to describe new measurement methods as they were developed. It was simple to add the coordinate transformations that enabled the description that the radiologists needed. And it was easy to add that final transformation that enabled the control of the spectrometer. The result was an encapsulation of essential complexity, offering each user group simplified access, allowing them to focus on their needs.
A VARIATION ON UNIX PIPES FOR MRI
Another path to simplicity can be seen in the motto “divide and conquer”. Many songs of praise have been sung for the concept of Unix pipes and how powerful they can be in making complex tasks appear simple. At Bruker I was once inspired by Unix pipes to produce something similar.
Our problem was one of growing complexity. More and more measurement methods were being developed. The size and number of images that the users were collecting seemed to be growing without an upper bound. As the measurement methods changed, the image reconstruction had to be adapted to provide new capabilities. The simple expectation of the end users was that the images would be immediately visible, with no perceivable delays. We needed to change our monolithic structure to create a modular structure. At the same time, we needed to improve performance and keep resource consumption (memory and CPU time) under tight control.
The key observation was that the actions required in an MRI measurement can be described as a pipeline and that actions can run in parallel. The basic steps include (1) raw data acquisition, (2) optional storage of raw data, (3) 1’st pass Fourier transformation of individual profiles, (4) sorting of the profiles, (4) 2’nd (and higher) pass Fourier transformation of the data, (5) storage of the final images and (6) triggering of the image display package.
Our solution was to package each of these steps in a separate program and then to chain them together as individual filters in a pipeline. The raw data went in at the start and the completed images came out at the end. Data was passed from one program to another using shared memories for communication and semaphores for the synchronization. Since all programs had access to the measurement description through the parameter handler, no communication about the data structure was required. Our memory consumption could be controlled by limiting the size of the buffers used to pass data through the pipeline. Our CPU consumption could be controlled with minor adjustments of process priorities. The concept was made simple for the developers by providing a basic implementation of an “empty” filter to hide all the communication behind just a few function calls.
After debugging several flaws in the Silicon Graphics Unix semaphore implementation, the concept worked perfectly. Life was good.
Sometimes something that looks extremely complex can actually turn out to be extremely simple once it is understood.
One difficulty faced by the physicists that develop measurement programs for MRI is the issue of coordinate transformations. They must be able to convert the radiologist’s description of the desired images, using the patient’s body as the reference, to their description which uses the magnet as the reference. The calculations for these coordinate transformations can sometimes be a bit hairy.
One of the physicists once approached me asking for support. The calculations in his new measurement methods were so complex and time-consuming that the measurement failed, the computer couldn’t keep up. This problem had appeared after he added in the coordinate transformations needed to make his measurement method usable for radiologists.
When I looked at his measurement program one thing that popped out at me right away was a magical calculation which was so complex that it spread across several lines of the page. This calculation was not done only once, it was duplicated at a large number of locations throughout his code. This was the key transformation between the two coordinate systems. Clearly, if this expression could be simplified, it would help.
I spent most of the day analyzing this expression. I didn’t believe my results, so I threw everything out and did it again. Then I checked my work – several times.
The magical calculation could be reduced to the constant value “1”.
After substituting every instance of this expression with the value “1”, performance was no longer a problem.
THE WONDERS OF CODE VERSION CONTROL AND DAILY BUILDS
Q: What is a simple mechanism which is both easy to implement and which will ensure a minimal level of quality?
A: Daily builds
Q: How can we ensure that our development is truly following a path of controlled evolution towards a desired goal?
A: Have a clear version control strategy based on a good tool
When I first joined Bruker, they had a rudimentary version control tool, but no clear strategy. Each developer worked independently in his own local work space. Updates to the archive were done when the developer considered it useful. Builds of the full system from the archive were done when a customer delivery was pending. They rarely succeeded on the first attempt with all the negative consequences that entails.
Eventually the company recognized that the situation was intolerable. The first step was to switch over to a code control tool which was easier to use. In that context I also introduced the idea of a daily build. Using the available scripting tools and cron jobs I ensured that a clean build of the current development head revision would be done every night. After reviewing the results the next morning, I sent out mails to everyone that had checked in code that didn’t compile and build. It took several months for this to be accepted by everyone and it took an additional several months for a failure of the daily build to become the exception.
One important element in acceptance was that when I had made the mistake, I still sent out the mail announcing the failure, its cause and that I was responsible.
Once our head revision was clean, we then introduced a branching and labelling strategy. Suddenly customer distributions became reproducible. It was possible to differentiate between the development line and branches that had been created for qualification testing or for other purposes. The entire software organization became more professional in many ways.
Simple, evolutionary changes can have dramatic consequences.
ABSTRACT COMMAND HANDLING
Sometimes a simple paradigm shift can change things profoundly.
We had just finished converting our entire user interface to X11/Motif, which also forced us to shift to an asynchronous event-triggered software architecture. After many years of synchronous, main-loop programming, I felt as though my world had been turned inside-out. Then our group leader asked me to think about a command handling concept. After all he said, the events being received from the GUI can be thought of as user commands being sent to the underlying software.
After some thought and experimentation, I was able to develop a usable concept. The underlying software could dynamically register commands in a central registry at startup. A command parser was then defined to parse command strings to identify the command and its arguments. A command dispatcher was finally created which would locate the command in the registry and make the associated function call to perform the command processing. In a later stages a mechanism for inserting commands from the outside was created and a mechanism for recording and replaying command sequences was added in.
The final result was that anyone capable of using a scripting language of any kind, was able to write scripts to drive the MRI system and make measurements. This same mechanism offered an easy means to implement entire measurement protocols which would produce a pre-defined sequence of images of different types.
It had consequences for our development process as well. It was much easier to add new functionality and make it accessible through the command handler for testing purposes. Then, after full testing, it could be fully integrated to make it available through the GUI as well.
Another by-product was that our system qualification and acceptance testing became much more professional. The tests could be automated and were easily repeatable.
And it all started with a new X11/Motif-based GUI.
Jan, 1985 – Mar, 1989
FSM OR WHY I STILL MISS VAX VMS
Finite state machines (FSM) look simple and are extremely useful. Unfortunately, as the German saying describes, the devil is hiding in the details.
Almost any type of control logic can be described using an FSM. There is a widely accepted set of conventions to describe an FSM in a diagram, which makes it easy to communicate your solution to other people, including non-engineers. Unfortunately, there are relatively few developers that are truly able to convert such a diagram into functioning code that does exactly what the diagram described. Furthermore, in real life, there always unanticipated circumstances that lead to unforeseen situations. In an FSM this results in misbehaviour or outright failure.
Personally, I was always lazy. It was my opinion that after finishing and reviewing the original diagram, something else (preferably the computer) should take responsibility for writing the code. In the intervening years this has become a reality. If I may be permitted a short advertising blurb, the best such tool of which I am aware is Dezyne from the company Verum.
Back in the days such tools were still only the subject of fevered dreams. But what the VMS operating system on the VAX workstations offered was an assembly level FMS framework. It was possible to describe input events, states and transition actions using a collection of macros. This description was then used with the framework to provide a complete implementation. It was wonderful! Once I had the description of the FSM correct, it would simply work. In some ways this was the start of my passion for model-driven development.
MIEREN NEUKEN (WITH APOLOGIES TO ALL DUTCH-SPEAKING READERS)
I had a very simple assignment. Program a DSP that controlled the gradient magnets.
The input is a description of the gradient waveform broken down into linear segments. Some segments would have a constant value, others would be the rising or falling flank of a trapezoid. The DSP had to be able to generate outputs to the gradient magnets at a rate of 125kHz (one data point every 8 microseconds). Furthermore, a compensation for discretization errors had to be included. The goal was to ensure that the area under the gradient curve was as close as possible to the value required.
Programming the DSP was fun and interesting, although somewhat tedious. As I became more and more familiar with its capabilities, I kept finding new small optimizations to make my code better and/or faster. Then I had to sit down and do an analysis of error accumulation during a measurement, showing the sources of errors and what the accumulated error would be at the end of a measurement. This was then reviewed by one of the physicists to ensure that it would be within their tolerances.
The final step was to ensure that my code would be able to produce new data points quickly enough to keep up with the measurement. I had to be able to show that my code would always be able to generate a new data point within 8 micro-seconds.
So, I sat down and started counting instructions, along every possible path of control through my software. Then I added the execution time for all the instructions together and was able to show that I was able to remain below the targeted 8 micro-seconds. When I was finished, I went to my team leader to review my work and show him my results. When I showed him my results he congratulated me for the quality of my work and taught me a new Dutch phrase to describe this type of effort, “mieren neuken”.
A PERSONAL PRE-HISTORY OF SHARED LIBRARIES AND LINKER CONSTRUCTION
In 1986 we were doing development work on the next generation of MRI device within Philips, the T5 MRI scanner. This had a VAX workstation as its host and a separate slave computer that ran the data acquisition. My task started with developing a concept for the software architecture on the slave processor. This was my first encounter with an RTOS for embedded systems (pSOS), which meant that this was a very educational experience.
The slave computer had an unfortunately limited amount of memory and we had an unfortunately large number of tasks that needed to be done by the slave. We needed to find some mechanism to reduce our memory footprint, so as to have some reserves for future expansion. It is important to note that this was before the time of shared libraries.
At some point I had several realizations. A significant amount of the code on the slave processor was common to all of the tasks running there. Furthermore, most of this code was re-entrant (or could be easily made re-entrant), which made it possible for all tasks to share a single copy of this code. Finally, we had ownership of the download and startup procedure on the slave processor.
So, we created a tool which would first download our hand-made shared library, extracting the symbol table along the way. This tool would then perform pre-processing on the images of the other tasks during the download, to resolve all references to the shared library.
It was actually surprisingly easy, I just had to marry a hand-written “linker” with the code that directed the download and startup sequence for the slave processor. Once it was running, everything worked like a dream.
Shared libraries are simple, once you understand how they work.
Mar, 1983 – Dec, 1984
IT’S THE HARDWARE AND I CAN PROVE IT
Sometimes a problem requires doing something which isn’t in your job description.
We were developing an ultrasound scanner for clinical use. I had written a number of the drivers (for the display, the keyboard, the scan converter, etc) when one day my test system started locking up immediately after starting. The startup sequence seemed to be working perfectly, but as soon as I touched one of the keys on the keyboard, the system became unresponsive. Nothing more happened. The problem was completely reproducible, all we had to do was to start the system and press any key on the keyboard.
The resulting dialog between the responsible hardware engineer and me was classical. His statement was simple, the electronics had not been changed in any fashion. The software changes continuously. Something must have changed in the software to cause this behavior. He would not even look at the electronics until I could show, using a previously tested software version, that the software wasn’t at fault.
In retrospect I now have more understanding for his position than I had then. Honestly, we had no real concept of code version control or of stringent qualification testing for software. We were all selftaught software developers, more focused on our algorithms that on the system as a whole.
My solution was to come in early several days later. I grabbed a logic analyzer and its handbook, hooked it up to the system and started tracing signals. The result was that I could show that the keyboard was triggering an interrupt request. I could show the signal being sent out to clear the interrupt at the end of the ISR. However, I could also show that the signal to clear the interrupt never arrived at the (external) interrupt controller. I then waited for the electronics engineers to arrive, holding a glass of champagne in my mind. When they arrived I showed them my results. The responsible engineer did some analysis of his own later in the day and determined that a buffer chip along the signal path had burnt out.
Sometimes it’s not the software – but that always has to be proven!