ProLUG Admin Course Unit 3 ๐Ÿง

Storage & Logical Volume Management ๐Ÿง ๐Ÿ’พ

Intro

Ok, Week 3 / Unit 3. This week we are working on Logical Volume Management (LVM) and RAID. Fortunately I was able to listen to the entire lecture and read most of the chats which is always helpful.


Lab Notes ๐Ÿงช

Warmup

Redirects โœ‰๏ธ

cd ~ # Change Directory to Home

mkdir lvm_lab # Create a Directory called lvm_lab

cd lvm_lab # Change location into the lvm_lab directory

touch somefile # create an empty file called somefile

echo "this is a string of text" > somefile # This sends the output of echo into the file

cat somefile # concatenates and displays what is in the file

echo "this is a sting of text" > somefile # Overwrites the line with the same text

echo "this is a sting of text" > somefile

echo "this is a sting of text" > somefile

cat somefile # We are left with one line of text after repeating this action because this 
action overwrites.

echo "This is a string of text" >> somefile # The double arrow is redirect append

echo "This is a string of text" >> somefile # this adds a second line

echo "This is a string of text" >> somefile # this adds a third line

cat somefile # The concatenated output would be 3 line of "This is a string of text"

echo "this is our other test text" >> somefile

echo "this is our other test text" >> somefile

echo "this is our other test text" >> somefile

cat somefile | nl # This adds numbering to each line in the file

cat somefile | nl | grep test # This does nothing as there is no text within the file that 

contains the string nothing

cat somefile | grep test | nl # Also nothing

cat somefile | nl | other # Gives us our last 3 lines and associated line number

always | nl | before your grep

Lab ๐Ÿฅผ๐Ÿงช

Disk Speed Tests โฑ๏ธ

This is on my own virtual machine, so figuring out the unique commands took a bit of extra work.

lsblk /dev/sda2

p  #print to see partitions
d  #delete partitions or information
w  #Write out the changes to the disk.

Write tests ๐Ÿ’พ

saving off write data โ€“ rename /tmp/file each time

Checking /dev/sda2 for a filesystem

blkid /dev/sda2 ๐Ÿ‘

If no filesystem, make one

mkfs.ext4 /dev/sda2

mkdir space

mount /dev/xvda space ?

Write Speed Test ๐Ÿ’พ๐ŸŽ๏ธ

for i in 'seq 1 10';do time dd if=/dev/zero of=/space/testfile$ bs=1024k count=1000 | tee -a /tmp/speedtest1basiclvm

Write Test Result

real 0m0.003s
user 0m0.000s
sys 0m0.002s

Read Test ๐Ÿ‘“๐Ÿ’พ

for i in 'seq 1 10';do time dd if=/space/testfile$i of=/dev/null;done

Read Test Result

real 0m0.001s
user 0m0.001s
sys 0m0.001s

Cleanup ๐Ÿงน

for i in 'seq 1 10'; do rm -rf/space/testfile$i;done

LVM ๐Ÿง ๐Ÿ’พ

start in root (#); cd /root

Check physical volumes on your server (my output may vary)

[root@rocky1 ~]#fdisk -l | grep -i sda

output:

Disk /dev/sda: 32 GiB, 34359738368 bytes, 67108864 sectors
/dev/sda1           2048  2099199 2097152   1G 83 Linux
/dev/sda2        2099200 67108863 65009664 31G 8e Linux LVM

Basic LVM listings

pvs

output:

PV              VG                  Fmt    Attr PSize    PFree
/dev/sda2       rl_localhost-live   lvm2   a--  <31.00g  0
vgs

output:

VG                   #PV    #LV     #SN   Attr      VSize   VFree
rl_localhost-live      1      2       0   wz--n-    <31.00g    0

output:

lvs
LV    VG     Attr   LSize   Pool   Origin   Data%  Meta%  Move  Log Cpy%Sync Convert
root rL_localhost-live -wi-ao---- 27.79g
swap rL_localhost-live -wi-ao----  3.20g
pvdisplay (Needs SUDO or Root)

output:

PV Name          /dev/sda2
VG Name          rl_localhost-live
PV Size          <31.00 GiB / not usable 3.00 MiB
Allocatable      yes (but full)
PE               4.00 MiB
Total PE         7935
Free PE          0
Allocated PE     7935
PV UUID          gMVNd5-peB1-uUX6-Rw28-4Ncb-mi1b-rDJR38
  • PV Name: /dev/sda2

    • This is the physical volume (PV) name, which indicates the device or partition that has been initialized as a physical volume for use in LVM (Logical Volume Manager).
  • VG Name: rl_localhost-live

    • This is the volume group (VG) name to which this physical volume belongs. A volume group is a collection of physical volumes that create a pool of storage space from which logical volumes (LVs) are allocated.
  • PV Size: <31.00 GiB / not usable 3.00 MiB

    • The size of the physical volume is 31.00 GiB, but 3.00 MiB is not usable, likely due to overhead or alignment issues within the physical volume.
  • Allocatable: yes (but full)

    • This indicates whether the physical volume is available for allocation into logical volumes. It is set to “yes,” but the “full” remark means that all available physical extents (PEs) are already allocated.
  • PE Size: 4.00 MiB

    • This shows the size of a physical extent (PE), which is the smallest chunk of data that LVM manages. In this case, each PE is 4.00 MiB.
  • Total PE: 7935

    • The total number of physical extents on this physical volume. This is calculated based on the size of the physical volume and the size of each PE.
  • Free PE: 0

    • This shows the number of free physical extents on the physical volume. In this case, there are no free extents left, meaning all available space has been allocated.
  • Allocated PE: 7935

    • The number of allocated physical extents, which are being used by logical volumes. Since all available extents are allocated, this number matches the total number of PEs.
  • PV UUID: gMVNd5-peB1-uUX6-Rw28-4Ncb-mi1b-rDJR38

    • The unique identifier for this physical volume, which is used internally by LVM to track physical volumes across different systems or reboots.
vgdisplay: needs sudo

example output:

  --- Volume group ---
  VG Name               vg_data
  System ID             
  Format                lvm2
  VG Status             resizable
  MAX LV                0
  Cur LV                2
  Open LV               2
  Max PV                0
  Cur PV                2
  Act PV                2
  VG Size               99.99 GiB
  PE Size               4.00 MiB
  Total PE              25599
  Alloc PE / Size       12800 / 50.00 GiB
  Free  PE / Size       12799 / 49.99 GiB
  VG UUID               hFwi0D-GTlv-NFjp-O2he-x8Yw-kfIa-c3QqX6
lvdisplay: needs sudo

example output:

  --- Logical volume ---
LV Path                /dev/vg_data/lv_backup
LV Name                lv_backup
VG Name                vg_data
LV UUID                B1q8Iq-0tWz-Fk0P-xwQ7-0T5T-QC3c-XcK5T1
LV Write Access        read/write
LV Creation host, time hostname, 2024-10-08 12:00:00 +0000
LV Status              available
# open                 1
LV Size                50.00 GiB
Current LE             12800
Segments               1
Allocation             inherit
Read ahead sectors     auto
- currently set to     256
Block device           253:2

Creating & Carving LVM

A little struggle ๐Ÿ˜ฃ

So I ran into several hurdles when putting this into practice with Proxmox.

๐Ÿ“ Since I was using Proxmox the storage was already an LVM on account of it being a virtualized disk Solution: I created a second storage volume in addition to sda, sdb was emulated as an SSD and RAW storage. Once this was set up I was able to initialize the process.

๐Ÿ“ The demonstration lab suggested using /dev/xvd$disk the ‘x’ relates to xcp-ng, a virtualization technology that differs from mine. I am using KVM and VirtIO on Proxmox so the naming convention would differ. so vdb would be a naming convention akin to this system.

Ultimately, once I created this RAW SSD emulated storage I was able to move on.

The steps

1. Provisioning the Physical Volume โœ…

sudo pvcreate /dev/sdb

Output: Physical volume “/dev/sdb” successfully created.

Confirmed with: lsblk

2. Provisioning the Volume Group โœ…

sudo vgcreate examplegroup /dev/sdb

Output: Volume group “examplegroup” successfully created

Confirmed with: vgdisplay

3. Provisioning the actual Logical Volume โœ…

sudo lvcreate -L 1G -n lv1 examplegroup

  Logical volume "lv1" created.

Confirmed with: lvdisplay

4. Formatting a Logical Volume with a FS ๐Ÿ’พ โœ…

mkfs.ext4 /dev/mapper/lv1

  mke2fs 1.45.6 (20-Mar-2020)
  Creating filesystem with 262144 4k blocks and 65536 inodes
  filesystem UUID: a1b2c3d4-e5f6-789a-bcde-123456789abc
  Superblock backups stored on blocks:
  32768, 98304

  Allocating group tables: done
  Writing inode tables: done
  Creating journal (8192 blocks): done
  Writing superblocks and filesystem accounting information: done

Checking remaining VG Space

Broadly:

the VGS command provides a summary of all Volume Groups and includes information about the free space.

Looks like this:

  VG        #PV #LV #SN Attr   VSize   VFree  
  vg_data     2   3   0 wz--n- 199.99g  49.99g

More Specifically:

The vgdisplay command will show this info as it pertains to a particular pv ๐Ÿ‘

Ultimately

Thanks to putting our heads together in a studygroup, I was able to get through the completed process several times. I ended carving of a Volume Group into Logical Volumes. Some of which were formatted in Ext4 and some were formatted to XFS using MKFS. ๐Ÿ˜

The process was a little bit long with many tangential learning, ultimately I think this experience has deeply ingrained the process in my psyche, pushing out childhood memories of finger painting.

Fstab

I would like to return to Fstab and learn more about it. I ran into some issues mounting LVM’s with Fstab

LVS

Logical Volume Summary provides a summary as the name implies of all logical volumes present in the system. I wish I had known this command earlier as I was using the other commands listed above ๐Ÿ˜„

Example output:

  LV       VG        Attr       LSize   Pool Origin Data%  Meta%  Move Log Cpy%Sync Convert
  lv_backup vg_data  -wi-a----- 50.00g                                                    
  lv_home   vg_data  -wi-ao---- 20.00g                                                    
  lv_root   vg_system -wi-ao---- 30.00g

RAID ๐Ÿ’พ๐Ÿ’พ๐Ÿ’พ

RAID stands for Redundant Array of Independent Disks, more on that below.

Raw notes:

Created a proxmox vm with rocky 9.4 minimal with 3 x raw ssd emulated disks

MDADM


Lecture thoughts ๐Ÿชฉ

Everything ‘is’ a file after all

Always has been a file

this past Saturday. During that lecture I had a major awakening in regards to how unix works. I have heard the ’everything is a file’ mantra on several occasions. However, sometimes hearing is not fully understanding. My moment of understanding came when we checked running processes and opened one as a file. I knew that everything under the hood was a file, but ephemeral things like processes were not part of the picture. That truly blew my mind ๐Ÿคฏ

UUID’s go beyond a block storage device

It turns out that logical volumes are assigned UUIDs. UUIDs are unique strings that can be used to reference or link reliably to storage device. So logical volumes attain the same respect as physical volume.

if I were to run

  • bash blkid

or

  • bash lvddisplay

one would see a string like this:

f4d12857-93c3-4d6f-91e5-bb379f02e1d1

More on this later, I went down a rabbit hole. ๐Ÿ‡


Logical Volumes ๐Ÿง ๐Ÿ’ผ

Logical Volumes (LVs) offer a flexible way to manage disk storage in Linux. With LVs, users can create, resize, and move storage volumes without being limited by the physical layout of the disks. The Logical Volume Manager (LVM) abstracts the underlying physical storage, making it easier to manage disk space, support snapshots, and resize volumes as needed.

The Layered Structure ๐Ÿฐ

The structure of LVM involves multiple layers:

  • Physical Volumes (PV): These are the actual physical storage devices, like SSDs or HDDs. A Physical Volume can either be part of a Volume Group or encompass an entire Volume Group.
  • Volume Groups (VG): VGs are containers that hold one or more Logical Volumes. A Volume Group can span across multiple Physical Volumes, providing a flexible pool of storage.
  • Logical Volumes (LV): LVs are sections of a Volume Group that serve as the actual storage units. A Volume Group can contain many Logical Volumes of different sizes.
  • File Systems: Finally, file systems are placed on the Logical Volumes to store data.

The table below illustrates how a Volume Group (VG) can host several Logical Volumes (LVs) of varying sizes. Each Logical Volume is assigned a unique identifier (UUID), and snapshots are also given their own UUIDs.

VGLVFSSizeUUID ShortSnapshotSnapshot UUID Short
vg_datalv_rootXFS50 GBf4d1-91e5-bb37NoneN/A
vg_datalv_homeXFS100 GBa123-c567-8901snap_homeabcd-ef56-7890
vg_datalv_varext420 GBc9d8-89e0-f9a1NoneN/A
vg_datalv_backupext4150 GBf129-bc97-f134snap_backupf7d8-a67e-f765
vg_datalv_logsXFS10 GBe123-9abc-e1d4NoneN/A
vg_storagelv_mediaext4200 GBbc97-bc9e-5612snap_mediad98f-9e67-1cd2

I understand that this course focuses on the two most popular and reliable file systems, as these are the ones most commonly encountered in enterprise environments. However, I’ve noticed that BTRFs is starting to gain traction. I’ve listened to several talks and read extensively about its features and potential.

Some limitations apply to the status quo

Both ext4 and XFS subvolumes can handle a lot of scenarios. This includes online resizing. This is when the size of an LVM must be increased while actively in use (Online). However, for both of these filesystems, online resizing can only grow and LVM and not shrink. Additionally snapshots must be managed on a per LV basis, not a huge issue with the advent of automation tools.

RAID

Intro

RAID stands for Redundant Array of Independent Disks (originally Redundant Array of Inexpensive Disks). It is a data storage technology that combines multiple physical disk drives into a single logical unit to improve performance, increase capacity, or provide redundancy to protect data against drive failures.

How it differs from LVM

  • RAID is primarily for redundancy and performance.
  • RAID works at a block level across discs.
  • So it is sometimes used to create a high capacity pool.

Multiple configurations

  • There are configurations for differing degrees of performance
  • There are configurations for different levels of redundancy

Comparison Table

RAID LevelDescriptionPerformanceRedundancy
RAID 0Data striping without parity or mirroring.High (fast read/write speeds)None (no redundancy)
RAID 1Mirroring of data across two or more disks.Moderate (read speed improves, write speed similar to single disk)High (data is mirrored)
RAID 5Block-level striping with distributed parity.Moderate (improved read speeds, slower writes due to parity calculation)Moderate (can tolerate 1 disk failure)
RAID 6Block-level striping with dual parity.Moderate (similar to RAID 5, but slower writes)High (can tolerate 2 disk failures)
RAID 10Striping across mirrored sets (combines RAID 1 and RAID 0).High (fast read/write due to striping and redundancy)High (can tolerate multiple disk failures depending on which disks fail)
RAID 50RAID 5 arrays striped (combines RAID 5 and RAID 0).High (fast read speeds, but slower writes)Moderate (can tolerate up to 1 disk failure in each RAID 5 array)
RAID 60RAID 6 arrays striped (combines RAID 6 and RAID 0).High (fast reads, slower writes due to dual parity)Very High (can tolerate up to 2 disk failures in each RAID 6 array)
RAID 1+0Nested RAID 1 over RAID 0 (mirroring across striped sets).High (similar to RAID 10)High (similar to RAID 10)

Software & Hardware Raid

  • Software RAID is when the host Operating System maintains the linking and syncing.
  • Hardware RAID is when a dedicated device does this. For example a PCI raid controller. Older RAID Controller Card

A software RAID is:

is running as a general task through the CPU. With multi-core/thread processors and more robust process handling, software RAID is more reliable than in the past. Advantages:

  • Portable ๐Ÿฆถ
  • Flexible ๐Ÿง˜
  • Cost Effective ๐Ÿ’ฐ

Hardware RAID is:

Application specific hardware like a RAID Controller always has the potential to be more performant and reliable. But this is mfg. dependent, the good ones tend to be expensive (such as life) ๐Ÿ’ธ

  • Performant* ๐ŸŽ๏ธ
  • Reliable* โณ

Other Filesystem research/writing

Btrfs

I wrote an entire separate article about B-Tree FileSystem as I find it interesting Link to Btrfs article

Linux Filesystem Comparison

I find file-systems really interesting. Recently I made a huge note comparing the major FileSystems. Link to FileSystems note/article

Linux Filesystem Timeline

This timeline may help illuminate how filesystems incrementally improve over time.

YearMilestone
1993ext2 filesystem development begins.
1994ext2 released with Linux Kernel 1.0.
1994XFS first developed by SGI for IRIX.
2001ext3 released, introducing journaling.
2001XFS ported to Linux.
2006Development of ext4 begins to extend the capabilities of ext3.
2007BTRFs development announced by Oracle.
2008ext4 marked as stable in Linux Kernel 2.6.28.
2009BTRFs included in Linux Kernel 2.6.29 as an experimental filesystem.
2009XFS officially included in Linux Kernel 2.6.36.
2010ext4 becomes the default filesystem for many Linux distributions, including Ubuntu and Fedora.
2012BTRFs adopted by SUSE Linux Enterprise Server.
2014XFS becomes the default filesystem in RHEL 7.
2014Fedora starts offering BTRFs as an option.
2020Fedora 33 makes BTRFs the default filesystem.

Crisis Management ๐Ÿ”ฅ๐Ÿงฏ

Intro

I read a chapter from googles security handbook focused on responding to a security incident. This chapter covers a wide range of information including common mistakes, templates etcc… Scott has asked us to pull keywords that may help us better triage an incident. I decided to extend this to creating an incident response checklist for future reference. This order of the list slightly differs from that of the article, yet I think it summarizes the chapter well.

Overview ๐Ÿ”

Effective crisis management requires taking command and maintaining control of an incident. The outcome of a security incident largely depends on how well your organization prepares and responds, a process referred to as incident response (IR) capability.

Transparency ๐Ÿ”ฎ

Transparency is key in managing incidents, particularly in light of regulations such as GDPR and service contracts. Customers are continually pushing the boundaries for how quickly investigations must begin, progress, and be resolved. Organizations are often expected to respond to potential security problems within 24 hours or less.

As the saying goes: “There are only two types of companies: those that know they’ve been compromised, and those that don’t know.”

Steps of Crisis Management ๐Ÿชœ

1. Triage ๐Ÿฅผ

  • The First Step: Donโ€™t Panic! Not every incident is a crisis.
  • Differentiate between compromises and bugs.
  • Make educated and informed assumptions about the severity and potential consequences of the incident.
    • What data might be accessible to someone on the compromised system?
    • What trust relationships does the potentially compromised system have with other systems?
    • Are there compensating controls that an attacker would also have to penetrate?
    • Does the attack seem to be commodity opportunistic malware?

2. Manage ๐Ÿง 

  • Establish Your Incident Team:

    • Management Liaison: Coordinate between the technical team and upper management.
    • Incident Commander (IC): Ultimately responsible for ensuring that rules around confidentiality are set, communicated, and followed.
    • Operations Coordinator (OC): Coordinate the technical side of the incident response.
    • Legal Lead: Ensure the response complies with legal obligations.
    • Communications Lead: Ensure internal and external communications are clear and effective.

    Guidelines for Management:

    • Maintain a clear line of command.
    • Designate clearly defined roles.
    • Keep a working record of debugging and mitigation steps as you go.
    • Declare incidents early and often.

3. Declare ๐Ÿ“ฃ

  • Declare incidents as soon as they are recognized to ensure early containment and response.

4. Communicate โ˜Ž๏ธ

  • Avoid Misunderstandings: Ensure that all parties are aligned on the severity and impact of the incident.
  • Avoid Hedging: Be clear and concise in communication, avoiding ambiguity.
  • Meetings: Hold regular update meetings to ensure that the team and stakeholders are aware of the incident’s progress and next steps.

5. Operational Security (OpSec) ๐Ÿฅท

  • OpSec refers to the practice of keeping your response activities confidential.
    • Use secure lines of communication.
    • Avoid interacting directly with affected networks or components.
    • Lock down affected accounts immediately.
    • Shut down compromised systems to prevent further damage.
    • Use different credentials when interacting with compromised systems.
    • Ensure the right people are informed with the correct level of detail about the incident.

6. Investigate ๐Ÿ•ต

  • Follow the OODA Loop: Observe, Orient, Decide, and Act.
    • Forensic Imaging: Capture system images for later analysis.
    • Memory Imaging: Capture the contents of system memory.
    • File Carving: Extract useful files from data storage.
    • Log Analysis: Analyze system logs to identify suspicious activity.
    • Malware Analysis: Dissect malware to understand its function.
    • Sharding the Investigation: Divide the investigation into manageable parts.
    • Parallelize the Incident: Have different teams working on different aspects of the investigation simultaneously.

7. Handover โ›“๏ธ

  • Properly hand over any remaining tasks or investigations to ensure nothing is overlooked as the incident winds down.

8. Remediate ๐Ÿ› ๏ธ

  • Fix vulnerabilities and mitigate damage to prevent future incidents. Implement long-term security measures.

9. Close ๐Ÿ“’

  • Close the incident formally. Review what went well, what could have been better, and document the lessons learned for future incidents.

High Availability โš“๏ธ

refers to the design and implementation of systems, services, or applications to minimize downtime and ensure continuous operation. The goal of high availability is to ensure that a system is accessible and operational for the maximum possible amount of time.


Key Terms in HA ๐Ÿ”‘

1. Uptime & Downtime โ†•๏ธ

  • Uptime: The period during which a system is fully operational and accessible to users.
  • Downtime: The period during which the system is unavailable or not functioning as expected. HA systems aim to minimize downtime as much as possible.

2. Redundancy ๐Ÿ‘ฏ

Redundancy refers to duplicating critical components or functions of a system to increase reliability. In HA systems, components like servers, databases, or networks are replicated so that if one fails, another can take over seamlessly.

  • Active-Active Redundancy: Multiple systems work simultaneously, and if one fails, the others continue without interruption.
  • Active-Passive Redundancy: A primary system works actively, while a backup system remains idle until a failure occurs.

3. Failover ๐Ÿ›ค๏ธ๐Ÿ›ค๏ธ๐Ÿ›ค๏ธ

Failover is the process of switching to a standby or backup system when the primary system fails. In HA setups, failover is often automatic to minimize disruption. Failback refers to switching back to the primary system after recovery.

4. Load Balancing โš–๏ธ

Load balancing distributes network or application traffic across multiple servers to ensure that no single server becomes overwhelmed. It enhances both performance and availability by balancing the load and rerouting traffic in case of server failures.

5. Elastic Scaling ๐ŸŽ‹

Elastic scaling is the ability to automatically adjust resource capacity (compute, memory, etc.) based on workload demand. This is crucial for HA, as it prevents resource exhaustion during peak loads and reduces costs during low-demand periods.

  • Horizontal Scaling (Scaling Out): Adding more instances/servers to distribute load.
  • Vertical Scaling (Scaling Up): Increasing the resources of a single instance.
  • Auto-scaling: Automatic scaling based on real-time metrics.

In Kubernetes, elastic scaling is managed through Horizontal Pod Autoscalers (HPA), which automatically scale the number of pods in a deployment based on observed CPU utilization or other metrics. In HA systems using Kubernetes, autoscaling ensures that the right amount of resources are always allocated based on demand, contributing to both high performance and availability.

6. Fault Tolerance ๐Ÿคฆโ€โ™‚๏ธ๐Ÿ˜…

Fault tolerance refers to the system’s ability to continue operating correctly even when one or more components fail. Fault-tolerant systems detect, isolate, and handle faults without causing downtime.

7. Cluster ๐Ÿง‘โ€๐Ÿณ๐Ÿง‘โ€๐Ÿณ๐Ÿง‘โ€๐Ÿณ๐Ÿง‘โ€๐Ÿณ

A cluster is a group of servers or nodes working together as a single system to provide HA. Clustering ensures that if one node in the cluster fails, another node takes over its tasks, maintaining service availability.

In Kubernetes, a cluster consists of a set of worker machines, called nodes, that run containerized applications. Kubernetes ensures high availability by distributing workloads across multiple nodes and automatically replacing failed nodes or restarting containers.

In Warewulf, a cluster provisioning tool, high availability is addressed by enabling systems to quickly re-deploy compute nodes. Warewulf helps manage HA in high-performance computing (HPC) environments by ensuring compute nodes are readily available for workloads in case of node failure or maintenance.

8. Replication ๐Ÿ’พ๐Ÿ’พ๐Ÿ’พ

Replication is the process of duplicating data across multiple storage systems or servers. In HA, replication ensures that a copy of data exists on multiple systems, so if one system fails, another can continue providing access to the same data.

9. Disaster Recovery (DR) ๐Ÿ“‰๐Ÿ“ˆ

Disaster recovery involves strategies to restore a system after a catastrophic failure (e.g., data center failure). DR usually includes off-site backups and failover to remote data centers.

10. Mean Time Between Failures (MTBF) โฑ๏ธ

MTBF measures the average time between system failures. A higher MTBF indicates a more reliable system.

11. Mean Time to Repair (MTTR) โฑ๏ธ๐Ÿ› ๏ธ

MTTR measures how long it takes to restore a system to full functionality after a failure. Minimizing MTTR is critical for reducing downtime in HA systems.

12. Quorum ๐Ÿพ

Quorum is the minimum number of nodes or components in a distributed system that must agree or function correctly to maintain availability. Quorum is often required in cluster setups to ensure consistent operation.

13. Geographic Redundancy ๐ŸŒŽ

Geographic redundancy involves deploying systems across multiple geographical locations or data centers. This ensures that services remain available even if a region experiences a failure (e.g., due to natural disasters or power outages).

Relation to Kubernetes and Warewulf

I wanted to look at Kubernetes and Warewulf and how they relate to this topic as they are both systems written with GO and they are both well liked systems relating to HA.

I will write a future article all about GO and why it is fantastic at a later date. So I will just leave this point for now.

-A language built by the progenitors of UNIX and C for diverse connected systems.

  • Kubernetes ๐Ÿ€: automated failover container orchestration elastic scaling with Horizontal Pod Autoscalers and geographic redundancy through multi-region clusters.

  • Warewulf ๐Ÿบ: In HPC environments, Warewulf aids in HA by managing node provisioning and monitoring the health of the compute nodes. In case of failures, Warewulf can quickly re-deploy nodes, ensuring that the overall HPC workload is minimally disrupted.

Kubernetes and Warewulf both play key roles in maintaining high availability in modern infrastructures, with Kubernetes focusing on containerized applications and Warewulf on HPC cluster management.

HA & Incident Triage ๐Ÿšจ

High Availability (HA) systems improve operational security, allowing precise triage due to these factors:

1. Continuous Monitoring

HA systems continuously monitor performance to detect security threats in real time.

2. Continuous Comprehensive Logging

Logs are centralized, providing full visibility across the system for quick forensic analysis.

3. Declarative Structure

Declarative configurations enable automated remediation scripts for rapid self-healing.

4. Automated Alerting

Automated alerts prioritize and notify teams of security incidents as they happen.

Beyond triage, HA systems offer several other security benefits:

5. Containerized Components

Microservices are isolated, allowing affected components to be restarted without system-wide impact.

6. Elastic Scaling

HA systems dynamically scale resources to handle traffic spikes or increased workloads securely.

7. Automated Failover

Automated failover isolates compromised components, ensuring continuous uptime during incidents.

8. Data Replication by Design

Multiple data copies prevent loss and aid in disaster recovery during security breaches.

9. Event Replay

Event replay allows security teams to analyze incidents for better future defense.


Lessons Learned about HA

What I have learned from reading articles, searching and compiling these notes.

The Why

In todayโ€™s world, data is ingested, processed, and distributed at an unprecedented rate. To keep up, we must implement systems that operate with the same speed and frequency. The era of handling sequential tasks one at a time is over. High availability solutions such as Kubernetes, declarative infrastructure, cluster management, and stateless automation are now essential in modern IT environments.

Although some developers, administrators, and engineers express concerns about the complexity of these deployments, preferring simpler solutions, the reality is that we need to adapt to the increasing demands of data availability and security. Setting up a robust infrastructure now may have a higher initial labor cost, but it will undoubtedly reduce the risk of costly security incidents and system downtime in the future.

The What

The systems we must focus on include:

  • Kubernetes for orchestrating containerized applications across clusters
  • Declarative Infrastructure using Infrastructure-as-Code (IaC) for consistent and scalable deployments
  • Cluster Management to efficiently manage and scale distributed systems
  • Stateless Automation to ensure systems can self-heal and adapt quickly without human intervention

These technologies form the backbone of high availability infrastructures, designed to handle vast amounts of data without compromising performance or uptime.

The How

To implement these solutions, organizations should:

  • Automate deployment processes using tools like Terraform or Ansible to ensure repeatability and reliability.
  • Leverage Kubernetes clusters to manage microservices architectures, enabling fast scaling and robust fault tolerance.
  • Adopt a declarative approach by defining infrastructure states in code, allowing easier management and version control of systems.
  • Utilize stateless automation to reduce system reliance on individual components, making the system more resilient to failure and able to recover without manual intervention.

By adopting these practices, companies can build resilient, secure, and scalable infrastructures that meet the demands of todayโ€™s fast-paced data environments.

HA and SIR Synergy

The Why

As organizations become increasingly reliant on digital systems, security incidents are inevitable. Data breaches, malware attacks, and system vulnerabilities pose significant risks that can lead to costly downtime, reputational damage, and legal consequences. In this high-availability (HA) era, where uninterrupted service is critical, an effective Security Incident Response (SIR) plan is more important than ever.

The What

At its core, Security Incident Response (SIR) involves:

  • Identifying and responding to potential security incidents in real time
  • Containing incidents to prevent further damage and preserve critical systems
  • Eradicating malicious activity from systems quickly and efficiently
  • Recovering compromised systems and restoring operations to normal
  • Reviewing and improving security protocols based on lessons learned from incidents

In high-availability environments, this process works hand in hand with:

  • Failover mechanisms that redirect traffic or services away from compromised systems to maintain uptime
  • Resilient architectures where services are spread across clusters, reducing the impact of a security breach on overall operations
  • Automated remediation that helps detect and respond to threats before they cause major disruptions

Together, SIR and HA form a dual-layer defense to keep systems running while actively dealing with security threats.

The How

For Security Incident Response to effectively complement high availability, organizations should:

  1. Implement proactive monitoring and detection: Use tools like SIEM (Security Information and Event Management)

  2. Prepare for containment and isolation: High availability systems should be designed with segmentation in mind

  3. Automate incident response and failover: Automate responses to specific types of threats, such as deploying firewalls, initiating backups, or failing over to standby systems.

  4. Maintain constant communication: SIR teams should work closely with HA engineers to ensure that any security measures (e.g., shutting down affected systems) do not inadvertently cause a service disruption.

  5. Review and test regularly: Just as HA systems undergo regular testing for failovers, security incident response plans should be tested in simulated scenarios.


Why study failure?

  • Proactive Prevention ๐Ÿ› ๏ธ: Identifying potential failure points allows you to take preventive measures and avoid system outages.

  • Faster Recovery ๐Ÿš‘: Knowing how systems can fail enables quicker response and minimizes downtime with a tested failover plan.

  • Resilience Building ๐Ÿ›ก๏ธ: Understanding failure mechanisms helps design architectures that automatically handle issues, ensuring better uptime.

  • Risk Management โš–๏ธ: Acknowledging failure risks helps balance uptime goals with realistic risk management and avoids catastrophic downtime.


Service Level Objectives (SLOs) ๐ŸŽฏ

Intro

Service Level Objectives (SLOs) are specific, measurable targets ๐Ÿ“Š that define the expected performance or quality level of a service, typically included in Service Level Agreements (SLAs). They help ensure services meet customer expectations by setting clear benchmarks for key metrics like reliability, availability, and response time.

SLOs help teams prioritize work, manage resources, and drive decisions based on real performance data. ๐Ÿ”ง

Check out Google’s guide to SLOs here.

Bad Operations (BadOps) โš ๏ธ

  • The unspoken rule of 100% uptime is a myth โŒ.
  • 100% uptime โ‰  100% reliability.
  • Each extra โ€œnineโ€ of uptime comes at a significant cost ๐Ÿ’ธ.

Why Have SLOs? ๐Ÿค”

  • Engineers are scarce ๐Ÿง‘โ€๐Ÿ’ป: Impossible standards require more resources.
  • They inform decisions: The opportunity cost of reliability is clear with an objective.
  • SREs aren’t just automation experts; they’re driven by SLOs ๐Ÿš€.
  • SLOs help prioritize tasks effectively ๐Ÿ“‹.
  • They define error budgets to manage acceptable failure rates.

Service Level Indicators (SLIs) ๐Ÿ“

Intro

A Service Level Indicator (SLI) is a specific, quantifiable metric ๐Ÿ”ข used to measure the performance of a service, often forming the foundation of SLOs.

Setting a Solid SLI โš–๏ธ

Two out of five of these factors can be used to form a ratio that serves as a target SLI:

  • Number of successful HTTP requests / total HTTP requests (success rate) ๐ŸŒ.
  • Number of gRPC calls completed in under 100ms / total gRPC requests โฑ๏ธ.
  • Number of search results using the entire data set / total search results, including gracefully degraded ones ๐Ÿ”.
  • Number of stock check requests with data fresher than 10 minutes / total stock check requests ๐Ÿ›’.
  • Number of โ€œgood user minutesโ€ based on defined criteria / total user minutes ๐Ÿ“….

Glossary of Terms ๐Ÿ™Š

These are additional terms I have become familiar with that were not covered in this units notes but relayed by Scott.

Five 9โ€™s

Refers to 99.999% availability, meaning less than 5 minutes of downtime per year.

Single point of failure

A component in a system whose failure will cause the entire system to fail.

Key Performance Indicators

Metrics used to measure the performance and effectiveness of a system or service.

SLI (Service Level Indicator)

A specific, quantifiable metric used to measure the performance of a service (e.g., latency, uptime).

SLO (Service Level Objective)

A target or goal for an SLI, defining acceptable performance for a service.

SLA (Service Level Agreement)

A formal contract that defines the level of service expected between a provider and a customer, including penalties for not meeting SLOs.

Active-Standby

A redundancy setup where one component is active while another remains on standby to take over if the active one fails.

Active-Active

A redundancy setup where multiple components work simultaneously to share the load, and failure in one component is covered by the others.

MTTD (Mean Time to Detect)

The average time it takes to detect a failure or incident in a system.

MTTR (Mean Time to Repair)

The average time it takes to recover from a failure or incident once it has been detected.

MTBF (Mean Time Between Failures) ‘Mentioned Once’

The average time a system operates without failure, used to measure system reliability.


Reflection

  • Remaining Qustions?
  • How will I put this into practice?

Discord: https://discord.com/invite/m6VPPD9usw Youtube: https://www.youtube.com/@het_tanis8213 Twitch: https://www.twitch.tv/het_tanis ProLUG Book: https://leanpub.com/theprolugbigbookoflabs KillerCoda: https://killercoda.com/het-tanis