Thursday, November 1, 2018

DevOps: pitfalls of manual deployments

As part of my quest of continual learning, I want to go deep in the theory and why of DevOps, not just learn the tools / buzzwords surrounding the culture. To that end I've begun reading Continuous Delivery: reliable software releases through build, test, and deployment automation by Jez Humble and David Farley. The authors open the book discussing the deployment pipeline, and it's central theme to book.

Deployment Pipeline: an automated implementation of your applications build, deploy, test, and release process; benefits include:

  1. Makes every part of the process visible to everyone involved, regardless of silo.
  2. Improves feedback process, so that problems are identified and rectified more quickly.
  3. Enables teams to deploy any version of their software, to any environment in a predictable, repeatable, that is to say automated fashion.

Humble and Farley go on to describe three common antipatterns of the deployment pipeline: deploying software manually, deploying to a production-like environment only after development is complete, and the manual configuration of production environments.

The issues raised with manual deployments struck very close to home, as many of the underlying issues for manual software deployments plague traditional operation deployments as well. In fact the first VAR / MSP I worked for, had engineers perform manual deployments exclusively for every project (ie: physical hosts, virtual server, firewall, access point, switch, ect.). As a result almost every customers environment was different, and not from a purposeful design perspective.

An example of this was a client with multiple locations; and each site was different. Routing local subnets at firewalls one place, the next routing at the switch, another on the ISP's modem, and one was routing at the wireless controller (with a default route to the firewall). Less dramatic examples include spending hours troubleshooting tracking down network loops, because each switch was either not running Spanning Tree or running a different version of it. These examples may seem ludicrous to an established enterprise IT team, but the underlying issue of non-repeatable, non-audit-able deployments, that keep talented engineers tied up with repetitive boring tasks is likely very relatable.

Thursday, October 18, 2018

networking: pcaps tell the whole story

Working for an MSP, I have the opportunity to interface with a large number different clients and vendors. A few months ago a vendor contacted me on behalf of a mutual client, stating they "updated their scripts" and now their application, wasn't working for our mutual client. They had tested in their (the vendors) test environment, and determined the clients firewall (of which I was responsible), must be blocking the traffic.

I tested with the client and saw the traffic being allowed through; so I informed the client I'd work directly with the vendor and provide them with updates.

During testing with the vendor, I could see two way traffic being allowed thru the firewall; however, the vendor reported they weren't receiving any traffic from the client.

This caused me great confusion, and requested we run a Packet Capture on both sides to compare. Which quickly allowed me to determine that traffic was being sent / received on both sides. Not only that, all traffic was making it thru in a timely manner, but still the SFTP uploads and webpage calls were failing.

With the pcaps determining that bi-directional traffic was allowed; they agreed to dig into the pcaps with me. And we quickly determined the clients server was attempting to negotiate TLS 1.0, which had been depreciated on the vendors servers, in favor of TLS 1.2, as part of the "script updates".

A quick installation of the TLS 1.2 libraries on the clients servers resolved the issue. This experience has made me much quicker to run packet captures to check for obvious issues, rather than pushing them off as a last resort!

Monday, October 15, 2018

networking: Class of Service

In part 1, I stated that CoS doesn't refer to just layer 2 tagging of frames, as is commonly believed. Rather, CoS is the implementation of QoS principles enacted at various layers of the OSI model.

CoS facilitates the prioritization of traffic flows over a common path.
  • a means to recognize and control different types of traffic
  • ability for application traffic to be considered more or less important
  • mechanism to manage congestion of traffic
IEEE 802.1p/Q at the Ethernet layer and DSCP at the IP layer are some of the most commonly utilized standards-based CoS mechanisms.

Layer 2 method of CoS: 802.1p/Q Priority Code Point

  • 3-bit field in the 802.1q tag, with a value between 0-7, used to differentiate / give priority to certain Ethernet traffic.
  • When configuring lldp med, setting the "priority" or PCP value to 5, sets the PCP flag to 101, which will give those Ethernet frames the highest priority.
  • Because 802.1p/Q is a Layer 2 (Ethernet) standard, it only applies to the Ethernet header. At every Layer 3 boundary (router hop), the Layer 2 header, including PCP parameters, are stripped and replaced with a new header for the next link. Thus, 802.1Q doesn’t guarantee end-to-end QOS.

Layer 3 method of CoS: DSCP - Differentiated Services or DiffServ

  • 6-bit field in an IP header, with a value between 0-64, used to differentiate / give priority to certain IP traffic.
  • When configuring lldp med, configuring the DSCP value to "46", sets the DSCP flag in the IP header to "101110", and Datagrams with this tag will have the highest priority.
  • Network devices MUST be configured to use existing CoS values or they may be overwritten.

An example configuration string from a Brocade / Ruckus switch:

lldp med network-policy application voice tagged vlan 30 priority 5 dscp 46 ports ethe 1/1/1 to 1/1/48

Part 1, 2

Resources Utilized:

Tuesday, October 9, 2018

powershell: check service status and health

Had a unique situation recently with a client whose IMAP Proxy would go offline after a reboot. All the services were running, and Event Logs were less than helpful to determine root cause. So while I wait for Microsoft support to check into and resolve the root cause (haha), I wrote a basic script to to resolve preventable outages due to this bug.

The script is written to import the Exchange Management Module.

Then check the health and status of the IMAP Proxy.

If the health and status of the IMAP Proxy is NOT online and healthy, it triggers a condition to start the service.

If the health and status of the IMAP Proxy IS online and healthy, it exits the script.

This script is scheduled via Task Manager to run on startup, after a 15 minute delay (to allow normal Exchange services a chance to start).

    Checks the health & status of the IMAP Proxy service, and starts it if not healthy and online.

    CREATE DATE:    2018-10-09
            v1.0 - Completed script and deployed via Task Schedulder to run on startup, after a 15 minute delay.

# add exchange management module
Add-PSSnapin Microsoft.Exchange.Management.PowerShell.SnapIn

# check status, and start if NOT already online
Get-HealthReport -Identity "$servername" | where {$_.HealthSet -eq "IMAP.Proxy"}
    if ({$_.State -ne "Online"}) {
            Set-ServerComponentState -Identity "$servername" -Component "imapproxy" -Requester "HealthAPI" -State "Active"
        elseif ({$_.State -eq "Online"}) {
link to code on github

Wednesday, October 3, 2018

networking: CoS vs. QoS

CoS - Class of Service
QoS - Quality of Service

NOT a layer 2 vs layer 3 differentiation.
NOT a guarantee for traffic (dependent upon each hop respecting the request).

QoS - Quality of Service: an umbrella term which covers the use of features such as traffic policy, shaping, and advanced queuing mechanisms.

CoS - Class of Service: a form of QoS applied at layer 2 (ex: PCP) and layer 3 (ex: DSCP).

Part 1, 2

Resources Utilized:

Wednesday, September 26, 2018

docker: replicated vs global services

When in Docker swarm mode, an application image is deployed by creating a service, run across a Docker swarm (on worker nodes), rather than a container running on an individual host.

There are two modes a service can be run in: replicated and global.

Replicated mode - a set number of identical containers are created, and that number can be modified via the “--replicas” & “--scale” switches. The default mode is replicated, and the default number of replicas is 1.

docker service create --name replica-test --replicas 3 nginx

Global mode - creates an identical container on each node in the swarm, this number cannot be modified (only removed entirely).

docker service create --name global-test --mode global httpd

When using replicated mode, you declare a desired service state by creating or updating a service, the orchestrator realizes the desired state by scheduling tasks. For instance, you define a service that instructs the orchestrator to keep three instances of an HTTP listener running at all times. The orchestrator responds by creating three tasks. Each task is a slot that the scheduler fills by spawning a container. The container is the instantiation of the task. If an HTTP listener task subsequently fails its health check or crashes, the orchestrator creates a new replica task that spawns a new container. (source)

When using global mode, you declare the desired service state by creating a global service. The orchestator then creates a task and schedules it for every node in the swarm, there is no defining how many many containers are created or which nodes they are created on, as it runs a single instance on all of the nodes. If a node is added to the swarm, the orchestator creates and schedules a task for the global service on the new node. Common use cases would include monitoring agents and security applications (AV software).

The diagram below, is from the Docker docs, and shows our nginx three-service replica in yellow and our apache (httpd) global service in gray.

Tuesday, September 25, 2018

AWS Certified Solutions Architect

I started digging into AWS, Amazon Web Services, around March of this year. At first, it was just to understand their offerings, and terminology, so I could speak intelligently to the subject with my peers.

And then, as I was fiddling around with their services, it dawned on my how powerful the tools were. I could spin up servers with a few clicks, easily monitor those instances with CloudWatch, get pricing alerts with SNS, and was blown away when I deployed my entire testing environment with Elastic Beanstalk.

And so my "fiddling around" intensified, and I wound up with 18 pages of note, dozens of hours labbing, at least a hundred hours of study, 45 minutes of test taking and on the first attempt I am an AWS Certified Solutions Architect - Associate.

Resources Used:

Linux Academy -

AWS CSAA Official Study Guide -

AWS Documentation -

Friday, June 29, 2018

the makings of a website

Three days ago I didn't own a domain name.

However while registering this domain, Google alerted me to a substantial free credit for their GCP (Google Cloud Platform). I decided to take the free credit and poke around.

It took all of about 30 minutes to spin up an Nginx instance, obtain a public IP, setup firewall rules, and update DNS.

With the help from a tutorial, I setup SSL encryption from Let's Encrypt and automated the renewal process.

After about 15 minutes of writing worlds most basic HTML page, I was left feeling very fulfilled with the results:

Take Away's: 

Google's $300 credit for GCP will allow me to run my current instance for 2 years before it costs me anything.

The tutorial for obtaining an SSL certificate from Let's Encrypt was thorough. Be warned it requires some level of experience with Linux. You'll need to be familiar with navigating linux, mkdir, vim/nano, chmod, and chown (at minimum).

If you're going to take the time to obtain a certificate, the little bit of extra effort to automate the renewal is a no brainer!

Tuesday, June 26, 2018

Hello World!

Thanks for joining me as I journey through my career in IT.

My purpose for starting this blog is threefold:
1) Learning: opportunity to challenge myself to go deep into protocols and learn new technologies.
2) Branding:opportunity to demonstrate to other companies and peers my skills and strengths.
3) Sharing: opportunity to assist others in their journey.

I will strive to be as accurate as possible, and when I make a mistake please let me know so we can continue to learn together!