iainjmitchell

Terraform CDK or AWS CDK

Wed, 08 Sep 2021 09:00:00 +0000

Introduction

This post is the write-up of a comparison between Terraform CDK and AWS CDK. Both of these Cloud Development Kits (CDK) are new approaches to writing infrastructure-as-code. They both provide libraries of infrastructure constructs in a variety of programming languages. Also included are command-line tools to deploy the infrastructure.

Like traditional Terraform the Terraform CDK supports multiple cloud providers. It also supports providers of other types of infrastructure (e.g. other CDN). Whereas AWS CDK is dedicated to managing AWS infrastructure.

This assessment focusses on managing AWS infrastructure ONLY.

What was being assessed?

Ease of use
- How easy is it to use the CDK? Any pain points or features that make the engineers life easier.
Ease of integration into build and deployment pipeline
- The CDKs are executed on a build server as part of a deployment pipeline. How easy is it to set this up?
Sharing of common components
- Creating common reusable infrastructure components is an inherent feature of infrastructure-as-code. The mechanism of sharing these can influence how much reuse actually occurs.
Testing
- How hard is it to test drive the creation of the infrastructure-as-code?
Code cleanliness
- Do the patterns and design of the CDK encourage cleaner code?

Test application

Both of the CDKs were assessed against creating the same simple application in AWS. This was a AWS Lambda that read a message from a SQS queue and using data within sent a notification to a Slack channel. It also required a SQS as a Dead Letter Queue (DLQ) for repeated failed invocations. Additionally, the Lambda needed to read the Slack API credentials from AWS parameter store.

A summary of the architecture of the test application is in the diagram below.

TypeScript was chosen as a language for both the Lambda and CDK infrastructure code. Jenkins, self-hosted in AWS, was used as a build and deployment server.

Terraform CDK

1. Ease of use

The initial set-up documentation made it easy to get started with the CDK. Unfortunately, the documentation deteriorated after this point. Context based help within the IDE pointed to the Terraform HCL documentation rather than anything based on the CDK. Furthermore, for some queries it was not possible to find answers in the community. This resulted in diving into the source code or applying guesswork.

The CDK still relies upon Terraform state files which need to saved to S3 as part of your stack updates. As a disclaimer, I’m not a fan on this system of tracking infrastructure state away from the source of truth. But a couple of times the state file became out of sync with the infrastructure in AWS. Although, this is no different from using the Terraform HCL and you may be happy to accept this risk.

2. Ease of integration into build and deployment pipeline

To be able to run the CDK in Jenkins it requires a few dependencies. As such, it was necessary to create a Docker container image (see below).

FROM node:14-alpine

RUN apk --no-cache add curl
RUN apk add --no-cache \
        python3 \
        py3-pip \
    && pip3 install --upgrade pip \
    && pip3 install \
        awscli \
    && rm -rf /var/cache/apk/*
RUN apk add terraform --repository=http://dl-cdn.alpinelinux.org/alpine/edge/community
RUN npm install -g cdktf-cli

A few of these dependencies are organisation specific, but the important ones to note are:

AWS CLI
Terraform
TypeScript Terraform CDK CLI (NPM package)

Additionally, Jenkins required a AWS IAM policy for access to the infrastructure that it needed to update. This was then available to ALL jobs within Jenkins. Although it could have been more finely controlled by utilising IAM switch role.

Overall, it took a day to get Terraform CDK running on Jenkins. Which was hindered at times by the lack of documentation.

Terraform CDK has a concept of scope that’s passed into each of the constructors. The scope then becomes the current object (this). The LambdaFunction example below demonstrates this.

export class SqsSlackStack extends TerraformStack {
    constructor(scope: Construct, stackName: string, configuration: SnsSlackStackConfig) {
        super(scope, stackName);

        const lambda = new LambdaFunction(this, `${stackName}-lambda`, {..};
    }
}

Modularisation just requires that the scope is passed into the programming construct (e.g. Class, Factory Method). This makes creating of independent modules straightforward. In this case, as TypeScript is being used, these could shared through a public or private NPM.

One challenge around modularisation is that the AWS provider NPM module is a single mono-module. As such, a module for SqsQueues still has to include everything AWS in it’s node_modules.

4. Testing

At the point of implementing the application there was no unit testing support. It was possible to test by calling cdktf synth in the shell and asserting on the resulting output. This was complex and very slow to run.

Thankfully, a recent update to Terraform CDK has introduced better support for unit testing. Although at the moment it appears limited to testing through Jest. As this is a new development, it is not possible to assess how usable this addition is.

5. Code cleanliness

The majority of properties on the CDK classes are strongly typed. A notable exception is that of policy documents. These require a string of the policy in JSON format. This leads to storing the policy in a file or performing an inline JSON.stringify() call (see below). Neither of which lends itself to readable code.

const inboundQueue = new SqsQueue(stack, `${stackName}-inbound`, {
    name: `${stackName}-inbound`,
    redrivePolicy: JSON.stringify({
        deadLetterTargetArn: deadLetterQueue.arn,
        maxReceiveCount: 5
    }),
    visibilityTimeoutSeconds: 60,
    tags
});

The String values with set values (e.g. Lambda runtime) are handled no differently from other String values. The CDK would have benefitted from using enum types to help improve readability and reduce mistakes.

At times writing the CDK code felt long-winded and similar to writing Terraform HCL. It would benefit from more abstraction of the AWS concepts. For example, a addPolicy() function on the IamRole class rather than creating a IamRolePolicyAttachment object with the arns of the Policy and Role.

AWS CDK

1. Ease of use

Has dedicated language specific documentation which mades it very easy to get started and get building. Wider community is much bigger with more blog posts and opinions on implementing.

Each service supported by AWS CDK is installed as a separate NPM package (e.g. Aws Lambda). These each have their own documentation and independent versioning. However, during development occassional minor version incompatibility occurred. So it is advisable to keep these package versions aligned.

Underneath, AWS CDK is powered by Cloudformation which has two benefits. The first is that the change set is always compared with the source of truth (AWS). Second, changes can viewed through the Cloudformation console in AWS which can help diagnose problems.

2. Ease of integration into build and deployment pipeline

The AWS CDK NPM module contains everything required to run the AWS CDK CLI. It can be installed from the projects package.json. As such, there is no need for a dedicated docker container image for running in Jenkins.

Access to resources in AWS are controlled in one of two ways. First, like Terraform CDK a policy can be assigned to the role that is executing the AWS CDK command. Second, an independent role can be created for executing a particular stack. In this case the Jenkins role just needs access to cloudformation. The execution role arn is passed in via the CDK command line call (see below).

cdk deploy --require-approval never --role-arn arn:aws:iam::9888885:role/my-aws-cdk

Getting AWS CDK running on Jenkins was straightforward and it was working in less than an hour.

Like Terraform CDK, the AWS CDK shares the concept of a scope that is passed from Construct to Construct.

export class NodeSqsLambda extends cdk.Construct {
    constructor(scope: cdk.Construct, id: string, configuration: SqsLambdaConfiguration) {
        super(scope, id);

        const lambdaFunction = new lambda.Function(this, id, {..});
    }

AWS CDK encourages the use of extending cdk.Construct for your own code modules. This makes custom components consumed and interacted through the same mechanisms as those provided by AWS. Like Terraform CDK these modules could be easily shared via an NPM repository.

The splitting of constructs into separate NPM packages results in custom modules requiring less node_module baggage. It also makes their dependencies more visible and deliberate.

4. Testing

AWS CDK provides inbuilt assertions and snapshot testing capabilities. The assertion framework works by comparing against the yaml Cloudformation output. The ambiguity of the outputted cloudformation makes test driving the code difficult. This is especially difficult when there are references to other constructs by ARN. These are often rendered in a confusing manor (see deadLetterTargetArn below).

developmentservicessnsslackawsdevelopmentservicessnsslackawsinboundC75A12B7:
 Type: AWS::SQS::Queue
 Properties:
 QueueName: development-services-sns-slack-aws-inbound
 RedrivePolicy:
 deadLetterTargetArn:
    Fn::GetAtt:
             - developmentservicessnsslackawsdevelopmentservicessnsslackawsdlq3A3F2AC5
            - Arn
 maxReceiveCount: 5
 VisibilityTimeout: 60

5. Code cleanliness

The Construct Objects in AWS CDK are all strongly typed. They also have enums and other types to help provide correct values. For example, a Function has a runtime property which is a type of Runtime. This Runtime type is an enum of accepted values (see below). As a result, the code is clear and there is a shorter feedback loop to discovering invalid values. This is due to not having to apply against AWS to discover a misspelling.

const lambdaFunction = new lambda.Function(this, id, {
    runtime: lambda.Runtime.NODEJS_14_X,
    ...
}

There are objects and types for most parts of an infrastructure definition. For example, a Policy is defined using PolicyStatement objects.

const ssmPolicy = new Policy(scope, `${roleName}-ssm-policy`, {
    statements: [
        new PolicyStatement({
            actions: [
                'ssm:GetParametersByPath',
                'ssm:GetParameters',
                'ssm:GetParameter'
            ],
            effect: Effect.ALLOW,
            resources: [
                'arn:aws:ssm:eu-west-1:324234234:parameter/slack/token',
                'arn:aws:ssm:eu-west-1:324234234:parameter/slack/signing-secret'
            ]
        })
    ]
});

AWS CDK also has abstractions in place to create links between resources. Usually, linking a Lambda to SQS requires creation of an EventSourceMapping that references the two ARNS. In AWS CDK the same can be achieved by calling the addEventSource function on the LambdaFunction object.

lambdaFunction.addEventSource(new SqsEventSource(queue));

Finally, as mentioned AWS CDK encourages any custom components to extend cdk.Construct. This results in a more unified look and feel of the codebase.

Other thoughts

An advantage that Terraform CDK has is that it also supports other cloud and infrastructure providers. If your organisation has multiple cloud partners and a strong desire to use the same tooling then Terraform CDK is probably for you.

However, the Terraform CDK and AWS CDK are similar enough in their concepts that switching between the two is not jarring. There are far more differences between the AWS and Azure Terraform CDK components. Other types of infrastructure can also be managed through alternative means. For example, Fastly can be managed through TypeScript by using a NPM package.

Conclusion

AWS CDK is the more mature and fully featured of the two CDKs. The high standard of documentation made it far easier to use. Writing code was cleaner and more concise. It was straightforward to get working on Jenkins and had more flexibility in choice of security models.

Both CDKs supported modularisation and distribution of packages via NPM. Though AWS CDK did encourage a more consistent approach to writing custom modules. Unit testing is also now supported in both of the CDKs.

Even if you work in an organisation that uses different cloud providers it may be worth considering using AWS CDK. The two CDKs are similar and it would not be too jarring to move between them (e.g. AWS CDK for AWS and Terraform CDK for Azure).

If you are writing infrastructure-as-code for AWS then the AWS CDK is currently the better of the two. But, if you are currently using Terraform HCL or Cloudformation you will find either of these CDKs a significant improvement.

Release and Deployment

Wed, 08 Sep 2021 09:00:00 +0000

First a question,

What do you want your software release and deployment process to be like?

If you had no constraints from your organisation what would it look like?

There are two predominant factors around release and deployment:

Automation - How much do you want to automate? On one side of the scale is a manual release process with no automation. At the opposite end of the scale is everything automated with few (if any) manual gates . This is would involve automation of deployment, testing, security scans, automatic rollbacks etc.
Trust - There are two elements of Trust. The first is how much the organisation trusts the product teams? Do they have change approval boards and release windows. Or do they trust the team to deploy whenever they choose? The second consideration is how much trust there is within the team. Is the team confident that everyone will conform to standards and guidelines? Or is trust lower that means that it requires a more formal process?

Using these two factors we can make a axis for comparing different approaches to software release and deployment.

Looking at this axis can you pinpoint where your attitude to release and deployment lies? Do you have a preference towards more manual gates (lower trust)? And how much do you expect to happen without human intervention (automation level)?

Onto these axis I have placed well known techniques for release and deployment of software.

Let me explain by reasoning behind these placements….

Change Approval Board (CAB) and Manual Deploy

In the bottom corner is a situation with CABs and little done to automate the deployment process. The need for approval is an indicator that there is an expectation that the team will break the product. Which demonstrates a low level of trust in the abilities and professionalism of the team.

The throttling of releases by this approval discourages the team from automating. Control of access to environments may even prevent the running of automation.

Pull Requests (PR) on Feature Branches

This is often a situation where there is more automation. For example, there is often a build server that runs when integration occurs. Trust appears higher within the organisation as auditable PR reduces need for wider approval. Yet, the enforced PR itself demonstrates a lower level of trust within the team itself. Which is exactly the intent of PRs.

In Open Source project you don’t know who is working on your project. Having an enforced review process might be a good idea, especially if you don’t want to spread Malware. The downside to this is a culture of lower trust. This can lead to dominant people becoming the arbiters of what is ‘good’ for the codebase. Furthermore, having system barriers that prevent pushing to mainline impacts trust. This result in lower empowerment and frustration (i.e. needing to create a branch and PR for a README change).

Continuous Integration (CI)

Martin Fowler’s post about CI is well worth a read as it clears up many misconceptions about CI. A common misconception is that you are doing CI if you have a build server.

Continually Integrating involves the team regularly pushing/merging code to mainline. Thus, it is essential to have automation around build and running of tests. Furthermore, it also demonstrates a higher level of trust within the team. As they are trusting that any agreed practices (e.g. peer review, standards) are happening.

Yet, there is not absolute trust as gates remain to testing and/or production. This could originate from the organisation, team or a combination of both. These gates also make it unlikely that the team will pursue further automation. Such as full deployment pipelines, security scans and in-live testing. It is also feasible that the deployment to other environments could still be manual.

Continuous Delivery (CD)

CD is the ability to release to production at anytime. An interesting facet of this is that mainline has to constantly be in a releasable state. This requires higher trust within the team and organisation. As both would need reassuring that releases will not cause impact to customers. Furthermore, the move to CD will also increase the level of automation. This could include: full deployment pipelines, security scanning and monitoring of production environment.

Continuous Deployment (CDP)

In the top corner in the zone of high trust and automation is CDP. Every code push/merge to mainline is deployed automatically to production. Reaching this point requires a high level of organisation and team trust. There is no gate to deployment and engineers are trusted to avoid customer outages.

Accompanying this is an automate everything approach to software release and deployment. Which may include in-live testing, service observability and automatic rollbacks when deployments fail.

Conclusion

The approach to software release and deployment is as much as about trust as it is about automation. The level of trust can be determined at both the organisation and team level. As it is possible to have an organisation with absolute trust but a team who feel they need to impose gates.

The post also ran through where many software release and deployment techniques lie on the axis of trust and automation. This can be used to help understand why a particular technique has been chosen. Or determine an appropriate strategy for release and deployment.

Time Tracking

Tue, 27 Jul 2021 09:00:00 +0000

Time tracking is a very emotive subject with even the mention of the words likely to cause offense or an immediate defensive stance. This is hardly surprising when many peoples experience of time tracking is either a corporate Orwellian nightmare or a misguided attempt to attribute cost to projects or initiatives.

Many years ago when I worked in a call centre they had a tracking system built into the phones with a series of codes to enter for reasons that included breaks and visiting the toilet. It’s not clear what they were trying to measure with the latter, perhaps they were interested in who had the weakest bladder? It was all very creepy, not helped by the knowledge that a supervisor spent most of the time watching the real time statuses of all the staff and checked on people who had been on extended toilet breaks. Unfortunately for many sectors this has only got worse and worryingly the supervisors are being replaced with AI.

In spite of my bad experiences of time tracking, I’ve recently decided to start tracking my own time within work. The motivation for this is that in a technical leadership role my own time becomes subjected to the System Trap of Tragedy of the Commons. I have multiple people, initiatives, aims, requests all fighting for my attention.

For a while I have been using an Eisenhower Matrix to help determine where I should spend my time, but the missing data is how much of my time I am spending on this versus other day-to-day activities. This is where I believe the time tracking data would be useful. There is also a possibility of using this data to educate those who are fighting for my time, as visualising and educating is a way out of this System Trap.

My Minimum Viable Product(MVP) for time capture is using a Google Sheet to record the time data and the various activity ‘tags’ that I wish to record time against. With this data recorded it is straightforward to build visualisations of the information.

In the image above it shows the visualisation that I use in my time tracker. For a selected start date and number of days it will generate a log of the activities and a pie chart of the percentage of each activity tag.

What have I learned so far? I’ve noticed that there is an element of Observer Effect occurring. The fact that I am now recording what I’m doing has caused me to think more about how I am using my time. It certainly seems like I am procrastinating less and being more selective about which activities I am doing.

It is only an experiment at the moment and it will be interesting to see whether I stick to it and whether it proves to be useful information (is the value of the information worth the measurement?). Perhaps I should also be mindful that I never get to the level of adding toilet breaks!

If you’d like to try this out yourself there is a public copy of my spreadsheet that you can take a copy of here.

Alerting, Monitoring, and Diagnostics

Wed, 04 Nov 2020 09:00:00 +0000

Inevitably your software will fail.

Requirements around detection and diagnosing of software problems are often neglected. At best they are the last thing considered and are reactionary. Hence, solutions can be muddled and confusing; often unclear about which problem they are trying to address.

Furthermore, the direction is often dictated by the companies choice of Dashboard/Alerting tooling. Rather than engineering having the time and freedom to consider what they need.

In this post, I will run through the differences between Alerting, Monitoring, and Diagnostics of software systems looking at good practices and potential pitfalls.

Alerting

“It’s a dangerous world, man. You’ve gotta keep your eyes and ears peeled and be on the alert all the time.”

Vinnie Paul

Imagine, it’s half past six on a Saturday morning and a critical software system starts to deteriorate, by nine this has escalated into a full outage. However, also consider another less critical system that starts to deteriorate but continues to run.

The purpose of alerting is to give appropriate notifications to those who support software systems. Or to put it another way, what do you want to know about in the middle of the night?

Consequently, having levels of alerting is essential if you don’t want to be woken up for something mundane. A typical example of alert levels would be low, medium, and high. With only the latter triggering the nocturnal surprise.

Furthermore, it is also important to be able to send alert messages to multiple sources. This ‘fanning’ of messages allows the alert to be sent to multiple mediums (e.g. Slack, email, SMS) and potentially to different groups of people (e.g. engineers, service support).

The diagram below illustrates a simple but effective alerting system that I have seen in use for a product hosted on AWS. The definition of the alerting alarms is handled in CloudWatch, these can then be assigned a level by attaching the appropriate SNS topic (High, Medium or Low). These topics send the alarm out to any number of interested subscribers by various mechanism (i.e. email, Slack). Splitting topics by alerting level enables the subscribers choose the level of alerting ‘noise’ they desire.

Monitoring

I know enough of the world now to have almost lost the capacity of being much surprised by anything

Charles Dickens, David Copperfield

The central purpose of monitoring is to clearly show if your applications are healthy. I’m referring here to the information you glance at to see if everything is okay. Typically, this is the dashboard you have displayed in the office.

Primarily, this dashboard should be driven by the same alerting process I described in the previous section. Colour can be an effective communication aid for monitoring dashboards. Typically I’ve seen critical alerts use hues of red to draw the observers attention. Conversely, greens are used when everything is okay.

Furthermore, these dashboards may also contain contextual information to help the viewer make sense of the alerts. For example, it might be useful to show traffic levels or status of an underlying third party system. However, it is important to remember that less is more with monitoring dashboards. It is not about the detail, it is a call to action to investigate the problem.

The diagram above shows a sketch I’ve made to illustrate what a clear and informative monitoring dashboard might look like. Something is clearly wrong with the Cat API, perhaps this is causing the latency with the Animal Website? Oh wait, the website requests seem to have gone up quite a bit, perhaps that is worth investigating?

It is worth mentioning that many leading tools on the market are not tailored for making monitoring visuals. Products such as Kibana, Graphana, and Cloudwatch Dashboards are all based around making dashboards for a defined time frame (such as last 3hrs, last day etc). There’s usually an option to adjust this period in the top corner of the screen. So, why is this a problem?

Firstly, some of the data is pushed real time to the dashboard rather than being queried for a particular time frame. This makes integration with alerting data sources problematic, if not next to impossible with these tools.

Secondly, the context data that you might want to display might not be as simple as a single time period. For example, what if for one graph you need to see the last three hours of data whereas a neighbouring graph you want to see the data for the same period yesterday.

For these reasons, I’ve often seen teams resort to using Open Source tools like Smashing or building their own solutions for clear and informative monitoring.

Diagnostics

“It’s a funny thing about looking for things. If you hunt for a needle in a haystack you don’t find it. If you don’t give a darn whether you ever see the needle or not it runs into you the first time you lean against the stack.”

P.G. Wodehouse

Imagine you’ve had an alert or your monitoring dashboard has turned a shade of rouge. You know that one of your systems is failing, but what is causing it? This is where detail becomes important, you need as much as possible to help you diagnose and resolve the issue. Sources of diagnostic information include application logs, instrumentation metrics from infrastructure, and logs of external services (such as Web Application Firewalls and API Management).

However, be aware that there can be too much detail to sieve through, especially in the middle of the night! It is important to just log and capture information that is likely to be useful. Consider how ‘chatty’ your logs are, will they make it easier or harder to find that needle in the haystack? Another consideration is how long you retain your diagnostic information for, holding logs and metrics longer help spot long-term patterns, but at a higher cost and more data to search through. It is important to find the right balance for your team, not just falling back to default positions (e.g. whole organisation stores logs for 12 months).

Tooling wise, diagnostics is the forté of products such as Kibana, Graphana, and Cloudwatch Dashboards. The ability of these products to bring together numerous data sources and build in-depth visualisations based upon time ranges is ideal for supporting diagnostics.

If your system contains many moving parts (i.e. Microservices) it may be worth considering a Distributed Tracing solution, such as Zipkin or AWS X-Ray. These utilise a common identifier for requests through the system, which enables the user interface of these tools to provide timings and points of failure between the various parts of your system.

Conclusion

In this post I have covered the differences between Alerting, Monitoring, and Diagnostics on software systems. The diagram below illustrates how the amount of detail increases from Alerting to Monitoring and then surges in the Diagnostics arena.

Each of these use cases have unique requirements that call for different approaches and different tool-sets. But combined together they help to form a coherent picture of the state of a software system.

Job Title Diffusion Causal Loop Diagram

Sun, 27 Sep 2020 09:00:00 +0000

In my previous post I covered the phenomenon of Job Title Diffusion. That is the impact of a variety of job titles can have on a organisation.

Recently, I’ve had to examine this in more detail and found that a Causal Loop Diagram really helped to illustrate and understand the problem (see this post for more details about Causal Loop Diagrams). What I ended up with is the diagram below.

As a refresher, the S links move in the Same direction and the O links move in the Opposite direction. For example, as the variety of job roles increase in a team also the variety of technology and code increases in the team (S link). Also, as the variety of job roles increases the variety of work an individual can pick up decreases (O link).

There is a Reinforcement Loop going from variety of job roles to variety of work an individual can pick up to pressure to recruit specialists and back to variety of job roles. This often manifests itself in a vicious circle that drives the number of each type of specialist which reinforces the siloing of job role and variety of technology and code in the team.

I’ve also recorded that having more individuals who can only pick up certain types of work can cause pressure for more work in-progress which will negatively impact the amount of work delivered (work takes longer to complete and is impacted by workers being unavailable). This in turn creates pressure to recruit more specialists and fuels our Reinforcement Loop.

There are two output dangles from the system. The first of these is the bus factor risk of the team, this is a rather macabre idea of how much the team would be impacted by an individual being hit by a bus (a bus factor of 1 would be only one person knows about this part of the system). Second, is the pressure on team cost which is driven by the pressure on small team size, which in turn is driven by variety of job roles in the team.

Finally, I’ve added the intervention I suggested in my previous post as an input dangle. This is highlighted in yellow and named encourage people to work outside their specialism. This impacts the variety of work that an individual can pick up and starts to turn the Reinforcement Loop into a virtuous, rather than vicious, circle.

Law of Job Title Diffusion

Tue, 01 Sep 2020 09:00:00 +0000

I’ve been trying to put into words a phenomenon that I’ve increasingly seen in software organisations. This is my first attempt in describing what I’ve called the Law of Job Title Diffusion:

An organisations number of tools, languages, frameworks and source code repositories increases with every job title introduced.

My hypothesis is that in organisations with lower numbers of engineering job titles there are fewer programming languages, frameworks and tools in play. The source code is also more likely to be concentrated in a smaller number of repositories.

Imagine an engineering team where individuals do have particular talents, but work collaboratively to get work done. Technology decisions will be taken more holistically, such as writing automated testing in the same language used to build the actual solution or using the same JSON parsing library across the board. Thus reducing the number of languages and frameworks in play. Furthermore, the code for solution, tests and infrastructure is more likely to live in the same source control repository.

Conversely, a team with dedicated Software Engineers in Test, Frontend Engineers and Platform Engineers is more likely to have multiple language choices (e.g. Scala for tests, Python for infrastructure, Javascript for frontend and Java for backend). Moreover, they are also likely to store their code in different places. For example, separate repositories for infrastructure, automated testing and source code.

Is it possible to break this law? Possibly, if job titles are seen as more consultative rather than specialisms and there is a genuine culture of solving problems together.

Note: Sometimes different problems do require solutions, so a different language or framework might be the right choice. However, this should be a conscious choice rather than a by-product of having different job titles!

Asynchronous meetings

Thu, 06 Aug 2020 09:00:00 +0000

Asynchronous - not simultaneous or concurrent in time : not synchronous

https://www.merriam-webster.com/dictionary/asynchronous

The current pandemic has suddenly thrust many of us into remote ways of working. This post looks at Asynchronous (Async) Meetings, how to make them effective and how to conduct one in a tool such as Slack/Microsoft Teams.

What is an Async meeting?

An Async meeting doesn’t require all participants to meet and discuss a topic at the same time. Instead, the meeting happens in written form via a shared document or in a channel of a tool such as Slack or Microsoft Teams.

This style of meeting is beneficial to remote working, as it helps reduce fatigue caused by excessive amount of time on video calls. Furthermore, it can be advantageous when participants have different working patterns.

Additionally, they also give the opportunity for longer contemplation of the questions and responses. This should lead to better results and engagement, as it empowers those who are considered in their responses.

What makes an effective Async meeting?

Fortunately, many of the practices of effective meetings apply to Async meetings.

The first of these is to have a clear problem statement, which gives the participants the context and objective of the meeting. This could be a question that needs answering or a decision that needs making.

Secondly, the meeting needs a clear agenda. What are the sub-points that need discussing to get to reach the objective of the meeting? In async meetings these work better being a series of direct questions. As this removes ambiguity and discourages meandering off topic.

Thirdly, have clear outcomes and actions of the meeting. As in standard meetings, actions need to be documented, have responsible person(s) and be followed up on.

It is also essential to have a chair of the meeting. As with synchronous meetings this role keeps contributors on topic. Additionally, they are responsible for setting the timetable and reminding participants to contribute.

A unique requirement of Async meetings is timeboxes. Without these a conversation could last days, weeks or even months! The chair should set these and ensure that all participants are aware of the deadlines.

Finally, it is important to get commitment from each of the participants to contribute to the meeting. Without this there is a possibility that the meeting will not progress, or that a key stakeholder will dispute the outcome of the meeting.

Running an Async meeting in Slack/Microsoft Teams

Tools such as Slack and Microsoft Teams are a perfect medium for running Async meetings. Here are a few hints to using the effectively.

To begin with, have a dedicated channel for the meeting. This should be well named and be archived once the meeting has concluded. Furthermore, this should be private to virtually close the door on the meeting. Though this shouldn’t discourage from adding useful participants at any point in the meeting.

Secondly, use threads within the channels for each question or sub-point that is being discussed. This makes it much easier for the participants to navigate and review the meeting.

Next, when chairing a meeting use tools like polls and reminders to help the participants reach consensus. Reminders are particularly useful to automate the manual chasing that this would otherwise entail. If an appropriate tool is not available on the platform (e.g whiteboard) then simply link from the channel. Be careful not to move the conversation outside the original channel.

Incidentally, it is also important to remind participants not to constantly check and react to the meeting. Ideally, they should be checking the channel a few times a day rather than being at is beck and call. If the latter happens the meeting has slipped into being a long running synchronous conversation and is potentially worse than the video call that you’re trying to avoid!

Finally, ensure that the meeting channel is closed once a meeting has ended. This shuts the door on the meeting and removes it from the participants cognitive load. However, it is useful to archive the meeting channel as it may needed in the future to understand why a certain decision was made.

Conclusion

Async meetings can be an effective alternative to video calls and potentially even face-to-face meetings. They also give the opportunity for participants to have more time to consider questions, responses and their contributions.

Nevertheless, it is not as simple as creating a Slack channel and inviting a few participants. The meetings require specific agendas and strong chairing.

Causal Loop Diagrams: Ability to Cope

Thu, 18 Jun 2020 09:00:00 +0000

This is my favourite diagram at the moment….

I’ve shared it with countless colleagues, as I feel it really demonstrates the vicious circle that many Software Teams/Departments find themselves in.

It is shamelessly copied and adapted from an example in the book Seeing the Forest For the Trees. The example in the book is based upon a back-office function of a financial organisation, yet it shares many of the traits of departments and teams in Software.

This type of diagram is called a Causal Loop Diagram and is used to show the interactions between parts of a system. Each link between elements is annotated with either an S or O. These inform the reader whether the connected elements move in the same (S) or opposite (O) direction.

What is the diagram above conveying to us? If we start from volume and variety of initiatives, when this increases (as often happens in Software) the workload moves in the same direction (S) - it also increases! Consequently, an increase in workload will cause a decrease in the ability to cope, as this moves in the opposite direction (O).

As a result of this the incidence of error will increase, as this moves in the opposite direction of ability to cope. The increased number of errors will cause the strain in management to move in the same direction (increase). Finally, this strain will cause the same direction movement in the workload, thus increasing this further. Which completes this vicious circle that if left unchecked will continue to deteriorate.

There is also an additional information about how staff levels, cost and delays impact this loop externally. Furthermore, it also shows how service quality is impacted by this vicious circle.

The loop that is causing the vicious circle is an example of a reinforcement loop. These types of loop will continue to spin until some action is taken, in this example the influencing factor is the volume and variety of initiatives.

All reinforcement loops can create vicious or virtuous circles, it depends on the influencing factor(s). In our example above, if the volume and variety of initiatives is at a more manageable level it will decrease the workload. This in turn will increase the ability to cope, which will reduce the incidence of errors. For this reason the strain on management will be reduced which then leads to a reduction in workload. We now have a virtuous circle!

Hence, this causal loop diagram illustrates that the flow of work into teams is the most important lever to control both the ability of team members to cope and the sustainability of delivery. This is probably not a surprise to many software teams, but this diagram and causal loop diagrams in general are a great way of illustrating systemic traps that organisations get themselves into.

A Year in Retrospective

Thu, 14 May 2020 09:00:00 +0000

Recently, I completed a year of holding weekly personal retrospectives. In this post I’ll look at how I run my retrospectives, what has gone well, what hasn’t worked and finally why I’ve managed to keep to this routine for a year.

My weekly personal retrospective takes the following format:

Meditation
How am I?
Feedback
What has gone well?
What can I improve?
Review last weeks goals
Set weeks goals

Initially, I perform 10 minutes of mindfulness meditation to clear my mind ready to reflect. This is followed by asking myself how am I? Which is aimed at discovering how I’m feeling mentally. Am I anxious, tired, happy, sad?

Next I focus upon the feedback I’ve received in the past week. I record any positive and constructive feedback and spend a moment to reflect upon it. This feeds into the next two sections ‘what has gone well’ and ‘what can I improve’. Hopefully these parts are fairly self explanatory, but it is worth stating that I focus on all aspects of my life not just work. For example, last week I included teaching my 5 year old son how to ride a bike without stabilisers.

Subsequently I review the goals I have set myself in the previous retrospective, recording whether they have been met, worked towards or neglected. Finally, this leads to setting the goals for the next week. These could be goals from the previous week that are still outstanding or goals based on improvements that I previously identified. I always ensure a Work In Progress (WIP) limit of two for weekly goals, any more is counter productive. Having a WIP of two allows me to have a goal inside and outside work.

What has gone well with my year of weekly retrospectives? To begin with the recording of positive feedback and what has gone well has helped lift my spirit, especially when I feel like I’ve had a bad week^.

Furthermore, I have found having targeted goals have helped me complete long term projects, such as building a bike for my father and various work initiatives.

Conversely, what hasn’t worked with my personal retrospectives? Initially I tried to run my retro at 15:30 on a Friday, but I found that I often deferred it to Monday morning. This was due to work commitments or just generally winding down for the weekend. Therefore I now shifted the retro to always be first thing on a Monday.

Secondly, I experimented for a period using data from Todoist to reflect on where I was spending my time. I’ve been using Todoist for a number of years as a task list for Getting Things Done (GTD). As the tasks are categorised together by type of work I hypothesised that this data would give me a real indication as to where my efforts were going. It quickly came apparent that the data was unreliable, as many things I did just weren’t recorded in Todoist. Therefore I chose to abandon this part of the retrospective.

To conclude this post I want to reflect upon why I have managed to keep my personal retrospectives going for a year? My previous attempts have lasted mere weeks, abandoned due to neglect or simply being too busy. So, what has been different this time?

From the beginning, I have made the retrospective part of my routine. There is a calendar entry for the event itself. But, more importantly, I have a check-in every morning to review my current goals. Consequently, I am far more likely to work towards and complete these goals. Which in turn is a positive reinforcement loop to continued use of a personal retrospective.

Incidentally, by publishing this blog post I can mark one of my goals for this week as done!

^ I have recently been introduced to the concept of Gratitude Journalling, which also highlights the virtues of positive reflection upon mental health.

Acceptance Testing

Wed, 31 Jul 2019 10:39:51 +0000

Last week I had the following DM on Twitter from @rahabm about the current state of Acceptance Testing

Hi Iain. I remember that testing used to be taken serious back in the UK. What is the current state on Acceptance Testing these days ? What are your thoughts on who should write and implement an acceptance test ? 😀

I commented at the end of my rant back to @rahabm that I should probably write a blog post about this, so here it is!

To begin with, what do we mean by Acceptance Test? If we use Mike Cohn’s Testing Pyramid we are referring to the User Interface or system boundary test, which is the very top of the pyramid. For a website this would be a form of browser automation with code performing the clicks and inputting normally performed by humans.

As the Testing Pyramid suggests, we should write as few of these as possible. They are time consuming to write, maintain and run, plus have a much slower feedback loop than unit tests. If there are too many of these they can impact a teams cadence of delivery and discourage continuous integration.

Acceptance Tests should focus on critical user journeys rather than specific functionality. For example, for an e-commerce site you would expect there to be an Acceptance Test that selects a number of items and completes payment. They are also usually focussed upon positive outcomes, rather than xyz failed.

Who should write Acceptance Tests?

Some organisations have dedicated people, sometimes called Software Engineers in Test, to write Acceptance Tests. I would suggest not doing this as:

It will encourage more of these tests to be written, as it helps to occupy this persons time.
Often this person will work by themselves, becoming a bottleneck and a bus factor of one.
They may follow different coding practices and/or languages than the rest of the engineering team. I have seen entire test suites thrown away when the test engineer leaves the team for this reason.

A preferable solution is to employ a QA who thinks about quality in the system holistically and works with the engineering team to work out how to test it better. This person is NOT primarily a coder, but is happy to pair or join a mob with engineers to ensure quality.

iainjmitchell

Terraform CDK or AWS CDK

Introduction

What was being assessed?

Test application

Terraform CDK

1. Ease of use

2. Ease of integration into build and deployment pipeline

3. Sharing of common components

4. Testing

5. Code cleanliness

AWS CDK

1. Ease of use

2. Ease of integration into build and deployment pipeline

3. Sharing of common components

4. Testing

5. Code cleanliness

Other thoughts

Conclusion

Release and Deployment

Change Approval Board (CAB) and Manual Deploy

Pull Requests (PR) on Feature Branches

Continuous Integration (CI)

Continuous Delivery (CD)

Continuous Deployment (CDP)

Conclusion

Time Tracking

Alerting, Monitoring, and Diagnostics

Alerting

Monitoring

Diagnostics

Conclusion

Job Title Diffusion Causal Loop Diagram

Law of Job Title Diffusion

Asynchronous meetings

https://www.merriam-webster.com/dictionary/asynchronous

What is an Async meeting?

What makes an effective Async meeting?

Running an Async meeting in Slack/Microsoft Teams

Conclusion

Causal Loop Diagrams: Ability to Cope

A Year in Retrospective

Acceptance Testing

Who should write Acceptance Tests?