Tech Blog

ProLUG SEC Unit 1 🔒

March 23, 2025

Intro 👋

I’ve just started a new Security Engineering course created by Scott Champine through ProLUG. As a graduate of his Linux Administration course and an active contributor to the Professional Linux User Group, I felt compelled to make time for this new course—I’ve learned a great deal from his teachings in the past.

Worksheet

`Discussion Post 1`

Question

What is Security?

Answer

In regards to Cyber Security, Integrating protective measures throughout the system lifecycle to ensure the system maintains its mission/operational effectiveness, even in the presence of adversarial threats.

Question

Describe the CIA Triad.

Answer

The CIA Triad is a core model in systems security engineering.

Confidentiality – Preventing unauthorized disclosure of system data or resources, often enforced through access control, encryption, and information flow policies.
Integrity – Ensuring that system data and operations are not altered in an unauthorized or undetected way, including protection against both accidental and intentional modification.
Availability – Ensuring reliable access to system services and resources when required, even under attack or component failure.

Question

What is the relationship between Authority, Will, and Force as they relate to security?

Answer

In systems security engineering:

Authority is derived from policy and design requirements—what the system must enforce according to mission objectives, laws, or standards.
Will represents the commitment of system stakeholders to implement and maintain security measures.
Force is the application of engineered mechanisms—technical, administrative, or procedural—that ensure security objectives are realized in practice.

Question

What are the types of controls and how do they relate to the above question?

Answer

In systems security engineering, controls are safeguards built into the system to achieve security objectives. They align with Authority, Will, and Force as follows:

Administrative Controls – Derived from organizational policy (Authority) and guide design, personnel roles, and security governance.
Technical Controls – Engineered into the system as part of architecture and software/hardware features (Force), e.g., encryption, access enforcement, secure boot.
Operational Controls – Rely on human procedures and configurations to maintain secure operations (Will and Force), such as patch management and monitoring.
Physical Controls – Provide physical protection to system components (Force), e.g., secure facilities or tamper-evident hardware.

`Discussion Post 2`

Intro to the scenario[^3]

Find a STIG or compliance requirement that you do not agree is necessary for a server or service build.

Question

What is the STIG or compliance requirement trying to do?

Answer The compliance requirement encourages users to set up automated CVE patch updates from trusted providers within a 24-hour timeframe.

Question

What category and type of control is it?

Answer
This STIG is an administrative control. Since it is not built into the system by default, it must be applied and managed manually.

Question

Defend why you think it is not necessary. (What type of defenses do you think you could present?

Answer Initially, I found it difficult to identify a STIG procedural that I disagreed with. However, after extensive review, I selected this one. I believe automated patching is not ideal, especially for production systems. Patches can introduce unexpected behaviors in dependent systems. Additionally, relying on automation can foster complacency or a lack of awareness over time.

STIG


Apache Server 2.4 UNIX Server Security Technical Implementation Guide :: Version 3, Release: 2 Benchmark Date: 30 Jan 2025
Vul ID: V-214270           Rule ID: SV-214270r961683_rule           STIG ID: AS24-U1-000930       
Severity: CAT II           Classification: Unclass           Legacy IDs: V-92749; SV-102837

Group Title: SRG-APP-000456-WSR-000187

Rule Title: The Apache web server must install security-relevant software updates within the configured time period directed by an authoritative source (e.g., IAVM, CTOs, DTMs, and STIGs).

Discussion: Security flaws with software applications are discovered daily. Vendors are constantly updating and patching their products to address newly discovered security vulnerabilities. Organizations (including any contractor to the organization) are required to promptly install security-relevant software updates (e.g., patches, service packs, and hot fixes). Flaws discovered during security assessments, continuous monitoring, incident response activities, or information system error handling must also be addressed expeditiously.

The Apache web server will be configured to check for and install security-relevant software updates from an authoritative source within an identified time period from the availability of the update. By default, this time period will be every 24 hours.

Check Text: Determine the most recent patch level of the Apache Web Server 2.4 software, as posted on the Apache HTTP Server Project website. If the Apache installation is a proprietary installation supporting an application and is supported by a vendor, determine the most recent patch level of the vendor’s installation.

In a command line, type "httpd -v".

If the version is more than one version behind the most recent patch level, this is a finding.

Definitions

CIA Triad Core principles of security — Confidentiality, Integrity, and Availability — guiding protection of information systems.
Regulatory Compliance Adhering to laws and regulations governing data privacy, security, and operational practices.
HIPAA U.S. healthcare regulation enforcing privacy and security of patient health information.
Industry Standards Best practices and technical guidelines agreed upon within specific industries to ensure consistency and security.
PCI/DSS Payment Card Industry Data Security Standard for protecting cardholder data during processing, storage, and transmission.
Security Frameworks Structured guidelines for managing cybersecurity risks and controls within organizations.
CIS Center for Internet Security provides globally recognized benchmarks and security controls.
STIG Security Technical Implementation Guide — DoD configuration standards for securing IT systems and software.

Lab 🧪🥼

MariaDB STIG Remediation Lab – Q&A Format

Signing into Remote Host

Question

How do you connect to the remote host?

Answer

ssh mchammer@prolug.asuscomm.com with password SecLab12#$5

Initial Setup

Question

What tools should be installed?

Answer

tmux and vim via dnf install tmux vim -y

Question

Why use tmux?

Answer

To persist sessions (e.g., for using nohup).

Installing and Verifying MariaDB

Question

How is MariaDB installed and started?

Answer

dnf install mariadb-server, then start and verify with:

systemctl start mariadb
systemctl status mariadb
ss -ntulp | grep 3306

Question

How do you enter the MariaDB shell?

Answer

Run mariadb.

V-253666: Listing Users & Max Connections

Question

How do you view users and max connections?

Answer

Run:

SELECT DISTINCT user FROM mysql.user;
SELECT user, max_user_connections FROM mysql.user;

V-253677: Shutdown on Audit Failure

Question

What is the issue?

Answer

MariaDB must shut down or alert on audit failures like lack of space.

Question

What is the fix?

Answer

Configure alerting when log space is low using the OS or DB logging tools.

Question

What type of control is this?

Answer

Technical, detective control.

Question

Is it set on your system?

Answer

Not by default. Logging rollover may need to be configured manually.

V-253678: FIFO Audit Logging

Question

What is the issue?

Answer

MariaDB must overwrite oldest audit logs when storage is full (FIFO).

Question

What is the fix?

Answer

Use syslog or configure log rotation/purging. Example configuration:
```
[mariadb]
server_audit_output_type = 'syslog'
```

Question

What type of control is this?

Answer

Technical, detective control.

Question

Is it set on your system?

Answer

TODO

V-253754: Audit on Security Object Change

Question

What is the issue?

Answer

Changes to roles, privileges, and security objects must be audited.

Question

What is the fix?

Answer

Configure MariaDB Audit Plugin with the following:

DELETE FROM mysql.server_audit_filters WHERE filtername = 'default';
INSERT INTO mysql.server_audit_filters (filtername, rule)
VALUES ('default',
   JSON_COMPACT(
      '{
         "connect_event": ["CONNECT", "DISCONNECT"],
         "query_event": ["ALL"]
      }'
   ));

Question

What type of control is this?

Answer

Technical, detective control.

Question

Is it set on your system?

Answer

Not by default, but was configured on the remote server.

My 2025 Learning Plan

February 7, 2025

Where to start

I am a generally curious person who enjoys learning new things. However, the shear volume of information available can lead to feeling overwhelmed, distracted and aimless. From years of experience being an auto-didact, I have honed my craft of self directed study. Now, I create a solid learning plan that keeps me on track and feeling a sense of achievement. A learning plan is a personal roadmap that outlines what to learn, how to learn it and when to reach certain milestones. I start with goals and work my way backwards from there.

2025

If I am to state my purpose simply Will be the year I:

Get advanced with GO programming by writing several useful programs.
Learn Kubernenets and prove that knowledge with a Certification
Get comfy with Ansible
Get JLPT N5 Certified in Japanese Language

Footnotes

Embedded Rust? 🦀

January 12, 2025

Yes

Firstly Rust on Embedded is a different beast as the standard library is not used and memory safety is not on by default. However, there are still some advantages over a popular language like C or C++. HAL or Hardware Abstraction Layer separates the hardware from the code enabling more portable software that can compile to multiple architectures. Cargo improves development ergonomic by creating and managing the project and its dependancies. Thirdly, the build system is unified across platforms, so code will compile on Windows, Mac and Linux in the same way. Rust on embedded systems is a different challenge, as it does not use the standard library, and memory safety is not enabled by default. However, it still offers several advantages over popular languages like C or C++. One key benefit is the Hardware Abstraction Layer (HAL), which separates hardware-specific details from the code, enabling more portable software that can compile across multiple architectures. Additionally, Cargo enhances development ergonomics by simplifying project and dependency management. Lastly, Rust’s unified build system ensures consistent behavior across platforms, allowing code to compile seamlessly on Windows, macOS, and Linux.

HAL 🪢

Is really interesting, it is the concept of mapping hardware and storing the map for API access. HAL leverages Peripheral Access Crates (PACs), which are auto-generated Rust crates representing the registers and bitfields of a microcontroller. PACs allow safe and direct access to hardware registers while ensuring Rust’s strict type-checking and ownership rules are followed. HAL sits on top of PACs, abstracting these low-level details. Rust embedded HALs adhere to the embedded-hal traits—a collection of interfaces defining common operations like GPIO pin control, SPI/I2C communication, timers, and ADC usage. By implementing these traits, HAL provides a uniform way to interact with hardware, regardless of the underlying platform. HAL abstracts device-specific features into a user-friendly API. For example:

Configuring a GPIO pin involves selecting its mode (input, output, pull-up, etc.) without directly modifying hardware registers.
Communication protocols like SPI or I2C are exposed through easy-to-use Rust methods (read, write, transfer, etc.).

Cargo 📦

Cargo handles dependencies seamlessly using Cargo.toml. Developers specify libraries (called “crates”) with version constraints, and Cargo fetches and builds them automatically. Cargo:

Ensures reproducible builds by generating a Cargo.lock file that locks dependency versions.
Community-driven ecosystem (e.g., crates.io) simplifies finding and using high-quality, maintained libraries.

Managing Dependancies ⚙️ with Cargo.toml

[dependencies]
embedded-hal = "0.2.6"
stm32f4xx-hal = "0.14"

Cross-compilation support is integrated via targets 🎯

cargo build --target thumbv7em-none-eabihf

Enforced project Structure 👮‍♂️

my_project/
├── Cargo.toml       # Dependencies & configuration
└── src/
    └── main.rs      # Application entry point

Cross Platform 💻 💼

Tools like probe-rs allow seamless debugging and flashing of embedded devices on multiple platforms (Linux, macOS, Windows).
The cargo ecosystem integrates testing, building, and dependency management across platforms without additional tools.

Experienced Embedded Devs are switching

I was first made aware of Rust for embedded by experienced devs on Youtube who proclaimed their love of rust over C for professional projects. Channels like: https://www.youtube.com/@therustybits https://www.youtube.com/@JaJakubYT https://www.youtube.com/@floodplainnl https://www.youtube.com/@embedded-rust

are all claiming that they have switched over and are not looking back.

Personal Development

I’m writing this post as I plan to develop hardware projects using Rust for embedded systems. The combination of Rust and RISC-V microcontrollers is a particularly exciting intersection. In my sights are the ESP32-C3 and Raspberry Pi Pico 2, both of which I’m considering for upcoming projects. Instead of dealing with messy C, slow MicroPython, or the limitations of TinyGo, Rust allows me to create clean and performant projects—something I always strive for. Stay tuned for more updates!

ProLUG Talk: Kubernetes in the Enterprise

January 11, 2025

Kubernetes in Enterprise

As a dedicated member of the Professional Linux User Group, I gain valuable insights into essential industry tools, processes, and procedures from professional engineers who work hands-on with major infrastructure.

This evening, Michael Pesa of Lambda Labs delivered an excellent talk on best practices with Kubernetes and GitOps, shedding light on the challenges faced by traditional orchestration approaches. What intrigued me most was the discussion on Talos OS and Chainguard, particularly their use of Software Bill of Materials (SBOM). The concept centers around stripping systems down to their bare essentials, which not only reduces vulnerabilities but also improves performance.

Talos OS is particularly fascinating because it eliminates many traditional system components like SSH, systemd, glibc, package managers, or a shell. Essentially, Talos is just the Linux kernel with several Go binaries. This streamlined approach significantly reduces vulnerabilities and minimizes the attack surface. As Michael mentioned in his presentation, many vulnerabilities stem from privilege escalation, container escapes, and memory hacking. Talos mitigates most of these threats by enforcing API-driven controls instead of relying on a shell and by utilizing private key-based authentication throughout.

I am excited to experiment with these tools in my homelab, where I aim to create a modern, declarative infrastructure with ephemerality at its core.

📝 Notes from the Presentation:

Topic Covered

Immutable operating systems

Minimalist container images

GitOps strategies

Reproducible builds

SUSE MicroOS 🦎

Purpose:

Designed as a container host OS.

Features:

Read-only root filesystem for enhanced security.
Transactional updates managed via Btrfs snapshots.
Cloud environment integration with cloud-init; uses ignition elsewhere.

Usage:

Ideal for minimal, secure containerized workloads.

Talos Linux 🦅

Overview:

Minimalist design focusing on immutability.
Linux kernel paired with five Go binaries.
Lacks traditional components: no SSH, systemd, glibc, package manager, or shell.

Security:

Secure by default with a micro TLS stack and key-based authentication.
API-driven operations using YAML, akin to Kubernetes manifests.
Uses its own PKI infrastructure, creating a small attack surface.

Tools:

talosctl CLI for management.
Debugging occurs via ephemeral tools and remote APIs.

Considerations:

Best suited for Kubernetes clusters.
For bare-metal installations, use Matchbox or a PXE boot system.

Alternatives:

Flatcar and CoreOS are earlier container-focused OS derivatives.

Minimal Container Images 👝

Philosophy:

Containers are inherently immutable; minimal images reduce attack surfaces and improve performance.

Best Practices:

Use the smallest possible image, minimizing unnecessary dependencies.
Employ multi-stage builds for languages with heavy dependencies, separating build and runtime environments.
For statically compiled languages like Go, containers can often be reduced to a single binary.

Security:

Avoid including shells to reduce exploit vectors.
Favor minimal base images like Alpine or tools like Google’s Distroless and Chainguard’s SBOM-integrated containers.

Supply Chain Security 🔗🔐

SBOM (Software Bill of Materials):

A detailed list of libraries, packages, versions, and licenses within a container image.
Machine-readable format for automated security checks.

Software Attestation:

Authenticated metadata about an artifact (e.g., a container image).
Enables verification of the artifact’s integrity.

Chainguard:

Offers streamlined, dependency-minimized packages.
Main drawback: constant updates unless on a paid plan.

Challenges in Immutable Environments ♻️

Limitations:

Lack of SSH access and debugging tools on nodes and pods.
Ephemeral, short-lived infrastructure and read-only filesystems.

Risks:

IaC misconfigurations can cause widespread outages.

Mitigation:

Implement proper GitOps workflows.
Ensure consistency between development, staging, and production environments.

Observability Strategies in Immutable Environments 👀

Approaches:

Centralized logging for better insights.
Monitor clusters and services rather than individual pods or nodes.
Use ephemeral debugging pods or sidecars with shared access.
Declarative observability configurations (e.g., using Kubernetes CRDs).

Principle:

Treat nodes and pods as “cattle, not pets” to maintain scalability and consistency.

Key Principles of GitOps and Declarative Infrastructure 😰

GitOps Core Tenets:

Git as the single source of truth.
Automated reconciliation with tools like ArgoCD or Flux.
Infrastructure changes are auditable and reversible.
Prevents unauthorized and ad-hoc changes.

Homelab Use:

Start with a Talos system in Docker/Podman for experimentation.

GitOps Challenges 😰

Common Issues:

Subtle configuration drift across environments.
Risk of manual changes not being committed back to Git.
Managing rollouts from dev to staging to production.
Scaling to hundreds of clusters.
Variations in deployment strategies across teams.
Securely managing secrets outside Git (e.g., Kubernetes Secrets, HashiCorp Vault).

The GitFlow Workflow 💨

Workflow Overview:

Developers create long-lived feature branches.
Separate primary branches for development, hotfixes, and releases.
The trunk (dev) branch isn’t always stable or deployable.
Multi-environment deployments often use separate release branches.

Complexity:

Complex merging strategies, especially for hotfixes.
Tools like ArgoCD or Flux can automate deployments.

Trunk-Based Development 🐘

Overview:

Continuous integration with short-lived feature branches.
Frequent commits directly to the main branch.
Emphasizes small, incremental changes to reduce merge conflicts.

Advantages:

Simplifies CI/CD pipelines.
Encourages fast feedback and rapid delivery.

Tools:

Works seamlessly with automated systems like Kubernetes or Terraform.

ProLUG Links ⛓️

Discord: https://discord.com/invite/m6VPPD9usw Youtube: https://www.youtube.com/@het_tanis8213 Twitch: https://www.twitch.tv/het_tanis ProLUG Book: https://leanpub.com/theprolugbigbookoflabs KillerCoda: https://killercoda.com/het-tanis

Why K8S matters

December 29, 2024

Kubernetes is Important 🌐🐳

Despite the jokes or criticisms you may have heard, Kubernetes matters a lot. Once I understood the “Why,” I became much more motivated to learn the “How.” This is how I got started with Kubernetes using my Proxmox Home-Lab and K3S.

Firstly, I would like to illuminate the “Why,” as it’s an important philosophy to grasp before diving in. I firmly believe in understanding the “Why” before the “How.” 🧠

A Brief History of Infrastructure 🕰️

The Evolution 💾

Many moons ago 🌛, internet infrastructure relied on operating systems to run services. Unix and Linux were preferred because they are multi-user environments suitable for serving files to requesters. In fact, the first implementation of TCP was written on a UNIX system running on a NextSTEP computer. 🖥️ This model worked for decades: more users meant more machines. However, issues arose. Machines would go down, causing cascading effects. ⚠️ Misconfigurations wreaked havoc, and machines often ran inefficiently, either wasting resources or straining hardware. 🔧

In Comes Virtualization 💻✨

Virtualization revolutionized infrastructure by allowing computers to be divided into independent virtual machines (VMs). A VM is a fully self-contained operating system. This innovation enabled more efficient utilization of hardware, increased flexibility, and reduced downtime caused by hardware failures. 🚀

Containers Changed the Game 🐳📦

Containers brought another layer of efficiency and standardization. Unlike VMs, containers share the host operating system’s kernel but encapsulate applications and their dependencies. This reduces overhead and enables applications to run consistently across different environments. 🌍 Developers could now “build once, run anywhere,” making containers a key tool in modern infrastructure. 🛠️

Orchestration Was Needed 🤖🎛️

As the use of containers exploded, managing them became increasingly complex. Deploying, scaling, monitoring, and maintaining hundreds or thousands of containers manually was impractical. This is where orchestration tools like Kubernetes stepped in, automating these tasks and ensuring applications are always running, balanced, and recoverable in case of failures. ✅

The Why 🤔

Understanding the “Why” is key to appreciating Kubernetes’ value. Here are some of the core reasons:

Monitoring 📊: Kubernetes provides tools and integrations to monitor your workloads, ensuring you can observe application health and performance in real time.
Logging 📝: Centralized logging in Kubernetes makes it easy to trace and debug issues across distributed systems.
Security 🔒: Kubernetes enhances security through role-based access control (RBAC), network policies, and automatic updates, reducing vulnerabilities in production systems.
Ephemerality 🌀: Kubernetes embraces the concept of ephemeral workloads, where containers can be replaced automatically if they fail, ensuring high availability.
Reproducibility 🔄: Kubernetes enables reproducible deployments by using declarative configurations, allowing you to deploy the same infrastructure consistently across environments.

By addressing these challenges, Kubernetes transforms the way infrastructure is managed and applications are deployed, making it a cornerstone of modern cloud-native computing. ☁️

During Week 10 of the ProLUG Course 1 🛠️📚

John Champine², an OpenShift Engineer, delivered a compelling two-hour presentation on Kubernetes and OpenShift. His anecdotes and technical insights were especially engaging, offering both rich historical context drawn from his personal experience and intricate details about shared resource management. 🖥️⚙️

My Experience 💻🔧

I heavily utilize Proxmox VE to build out simulated production environments where I can practice various administrative and engineering tasks. In my homelab, I installed K3S and Talos to create a typical dev/testing/production environment. 🌐 One particularly unique workflow I used involved building custom Podman containers—yes, Podman! 🐋

It’s not widely known that podman play kube ³ can create a manifest for use in Kubernetes pods. With this method, I could prototype and build out containers, functionally test them, and then publish them for declarative deployment. This approach felt incredibly slick to me, as bugs were ironed out during the process, and the final deployment was straightforward. ✅🛠️

Most of my work was done with K3S⁴, a Rancher⁵-based distro designed for low-resource environments. However, I also experimented with Talos OS, setting up multiple virtual machines in a configuration resembling a multi-machine/node environment—with a sprinkle of jank to keep things interesting. 🤖✨

This hands-on approach allowed me to deepen my understanding of Kubernetes while also refining workflows that integrate containerization and orchestration. 🚀

Respect 🙌

From these experiences, I have developed a deep respect for Kubernetes. I see it as the operating system of the internet—an innovation that will inspire other similar systems. 🌐 While newer technologies like MicroVMs and hybrid container/VM architectures are emerging, I believe they can be easily incorporated into orchestration schemes like Kubernetes. 🤖🛠️

Given this perspective, I think Kubernetes will remain relevant for a long time, much like Unix/Linux. 🐧 It simply makes sense given the strenuous demands of the modern internet and the ever-growing number of attacks and incidents. Such a resilient system enables greater efficiency, enhanced security, and improved situational awareness for everyone. 🔒🚀

Footnotes

The above quote is excerpted from an earlier BlogPost Post techblog, 2024. ↩︎
John Champine Profile LinkedIn, 2024. ↩︎
Podman Export Docs Podman Docs, 2019. ↩︎
K3S Website Site Site, 2025. ↩︎
Rancher Website Site Site, 2025. ↩︎

ProLUG Admin Course Unit 16 🐧

December 28, 2024

Incident Response

Incident response is a structured approach to identifying, managing, and resolving unexpected events such as security breaches, system failures, or misconfigurations. It aims to minimize disruption, mitigate damage, and restore normal operations while implementing lessons learned to prevent future incidents.

Responding to incidents is a stressful event because it can involve many stakeholders and little time. This week we exercised our skills by live debugging in front of our peers on a remote host. The problems all related to failure modes and misconfiguration and the exercise was rewarding in that I learned a lot as always, and built some confidence.

Incident Response / Troubleshooting Scenarios 🧑‍💻

Scenario #1: Web Server Not Running 🕸️

Objective: Ensure the web server is running and responding on port 80.
Steps:

Verify web server service:
- Run: systemctl enable --now httpd (or similar command).
Check open ports:
- Run: ss -ntulp.
- Identify if the server is running on port 8087 instead of 80.
- Edit the configuration:
  - File: /etc/httpd/conf/httpd.conf.
  - Change Listen 8087 to Listen 80.
- Restart the service: systemctl restart httpd.
Ensure external connectivity:
- Check the firewall status: systemctl status firewalld.
- Options:
  - Disable the firewall: systemctl stop firewalld.
  - Open port 80 if needed.
Final step:
- REBOOT the lab machine.

Scenario #2: Mount Point /space Not Working 💾

Objective: Set up a 9GB partition on the /space mount point using LVM.
Steps:

Verify /space setup:
- Confirm the partition is not properly set up.
Create LVM setup:
- Identify disks: fdisk -l | grep -i xvd.
- Create physical volumes: pvcreate /dev/xvd<disk>.
- Create a volume group:
  - Run: vgcreate space /dev/xvd<disk1> /dev/xvd<disk2> /dev/xvd<disk3>.
- Create a logical volume:
  - Run: lvcreate -n space -l +100%FREE space.
Format the logical volume:
- Create a filesystem: mkfs.ext4 /dev/mapper/<logical_volume_name>.
Mount the filesystem:
- Create the mount point: mkdir /space.
- Add an entry in /etc/fstab:
```
/dev/mapper/<logical_volume_name> /space <ext4 or xfs> defaults 1 2
```
- Mount the filesystem: mount -a.
Final step:
- REBOOT the lab machine.

Scenario #3: System Not Updating 📦

Objective: Fix the system to allow updates via dnf and ensure kernel updates.
Steps:

Fix DNF repository configuration:
- Edit /etc/yum.repos.d/rocky.repo:
  - Set enabled=1 for all necessary repositories.
- Check /etc/yum.repos.d/rocky.repo.orig for reference.
- Fix the EPEL repository the same way.
Verify kernel updates:
- Edit /etc/yum.conf:
  - Comment out the line: exclude=kernel*.
Final step:
- REBOOT the lab machine if necessary.

ProLUG Links ⛓️

Wrapping up ProLUC PAC

December 28, 2024

The Wrap 📦

Yesterday, our close-knit group completed an intensive 16-week hands-on course in Enterprise Linux Administration, culminating in a live incident response session.

Over the course of hundreds of hours, I pushed myself to go above and beyond in my studies and responsibilities. Along the way, I formed strong connections with like-minded peers, navigating the modern educational landscape of YouTube, Discord, and KillerCoda.

I am truly grateful to have stumbled upon this seemingly random community and to have experienced the structured, effective teaching methods of Scott Champine (Het Tanis), an experienced and traditional educator.

Discord 💬

To understand the environment, we must first understand the platform. Discord is a communication platform that combines text, voice, and video chat, designed to create communities where people can interact in real-time. What makes it unique is its seamless integration of customizable servers, topic-specific channels, and robust tools for both casual conversation and collaborative work.

Working on discord harbored a comfortable sense of passive interaction. Unlike say Zoom, Skype or any other similar video communication platform, Discord allows for people to come and go as they please, have multiple presenters and open voice chat, replicating a real world meeting more closely.

This allowed for impromptu discussions / presentation, greatly improving the learning experience.

Study group 🏘️

Early in the course, I applied my leadership skills by organizing a formal schedule for our study group meetings. These sessions covered course assignments in detail while also exploring related topics through collaborative, interactive projects.

The format was casual and engaging. I would share my screen to walk through scenarios while the group discussed the subject matter. Others also shared their screens, demonstrating tips and tricks in unison.

One of the most effective tools I introduced was a shared note using Etherpad. Similar to Google Docs, Etherpad allows multiple people to edit a document simultaneously. However, it stands out by enabling access without requiring sign-in credentials, making it easy to share with anyone.

These activities relied heavily on trust, as it would have been easy for someone to disrupt the sessions. My leadership skills were frequently tested by off-topic individuals or disruptive participants, but such issues were usually short-lived.

Gains 💪🏻

Coming into the course, I already had a solid understanding of Linux, backed by a few years of experience. Additionally, I had completed RWXRob’s (Rob Muhlestein’s) Beginner Boost DevOps course a year prior.

What set this course apart was its group-learning dynamic. During Rob’s course, I worked alone, building projects and debugging through hard-fought, self-directed methods like reading documentation, brute-forcing solutions, and referencing forums. In contrast, group work brought added motivation, inspiration, and a collaborative approach to problem-solving. It helped eliminate mundane, off-topic roadblocks, allowing us to focus on core learning and progress more efficiently.

Connections ⛓️

Through the study group and community discussions, I’ve developed a strong connection with the ProLUG community and feel confident that I can rely on the server for discussions, questions, and troubleshooting. In the near future, I plan to give back by supporting future coursework and helping new learners navigate their journey.

Thankful 🙏

I’m deeply grateful to Scott Champine (Het Tanis) for offering this free course and dedicating so much of his time to it. I’m equally thankful to the server members who joined the study group and dove headfirst into the intricacies of systems.

Until next time! ✌️

ProLUG Links ⛓️

ProLUG Admin Course Unit 15 🐧

December 25, 2024

Engineering Troubleshooting

Systems engineering troubleshooting involves diagnosing and resolving complex issues within interconnected systems to ensure seamless operation and optimal performance. It requires a methodical approach to identify root causes, integrate solutions, and maintain system functionality while addressing both technical and process-related challenges.

Discussion Post 1:

Your management is all fired up about implementing some Six Sigma processes around the company. You decide to familiarize yourself and get some basic understanding to pass along to your team.

Quoted from the book

5S is a Japanese Lean approach to organizing a workspace, so that by making a process more effective and efficient, it will become easier to identify and expunge muda. 5S relies on visual cues and a clean work area to enhance efficiencies, reduce accidents, and standardize workflows to reduce defects. The method is based on five steps:

Sort (Seiri)
Straighten (Seiton)
Shine (Seiso)
Standardize (Seiketsu)
Sustain (Shitsuke)

What about the “5S” methodology might help us as a team of system administrators?

Identify and categorize common troubleshooting problems such as typos, illogical configurations, or vulnerabilities to establish clarity and prioritize issues. (Seiri)
Organize and catalog processes and procedures for addressing both routine and uncommon problem scenarios for quick access and consistency. (Seiton)
Validate and test all processes and procedures to ensure they function effectively and reliably. (Seiso)
Promote team familiarity by regularly practicing and drilling procedures, similar to incident response training, to build confidence and efficiency. (Seiketsu)
Apply the processes and procedures in real-world scenarios, evaluate their effectiveness, make necessary adjustments, and document improvements for future use. (Shitsuke)

By applying the 5S methodology to troubleshooting, the team can develop a shared understanding of how to consistently address issues, identify system failure points, and create opportunities for incremental improvement, fostering a sense of flow and efficiency.

What are the four layers of process definition? How would you explain them to your junior engineers?

input - anything entering the process or is required to enter the process to drive the creation of an output outputs - service or product that is created by this process events - predefined criteria or actions that cause a process to begin working. tasks - activities are the heart. a unit of action within the process. 4.1 - Decisions are possible made during or for tasks.

The four layers of a process—inputs, outputs, events, and tasks—function similarly to how a computer program uses functions to process variables and produce results. However, in a Six Sigma process, these elements are more dynamic, encompassing both virtual and physical components, as well as steps that may be driven by either human actions or automated systems. This flexibility allows Six Sigma to address complex workflows that combine diverse inputs and tasks to achieve consistent and efficient outputs.

Looking at our operation as a series of processes with layers like this can help us to identify, refine and standardize processes into Standard Operating Procedures (SOP’s).

Discussion Post 2: Your team looks at a lot of visual data. You decide to write up an explanation for them to explain what they look at.

What is a high water mark? Why might it be good to know in utilization of systems?

The phrase “high water mark” originates from marking a riverbank to indicate the highest water level reached during a season. This mark serves as a warning, signaling potential danger if the water rises beyond it in the future. In the context of systems, the high water mark represents historically safe operational loads. If metrics indicate that this threshold has been exceeded, it should alert administrators to a potential issue. For example, if the high water mark for daily memory usage was 14/18 GiB of RAM, and we observe the system suddenly using 16/16 GiB, this warrants attention as a potential problem requiring further investigation.

What is an upper and lower control limit in a system output? While this isn’t exactly what we’re looking at, why might it be good to explain to your junior engineers?

Control limits are essential tools for monitoring the stability of a process over time. The upper and lower control limits define the normal range within which a process output should remain when the process is operating correctly. If the output exceeds these boundaries, it signals a potential issue, indicating that the process may be out of control and requiring investigation or troubleshooting.

Definitions/Terminology

Incident An unplanned event that disrupts normal operations.
Problem The underlying cause of one or more incidents.
FMEA (Failure Mode and Effects Analysis): A method for identifying and prioritizing potential failure points in a process.
Six Sigma A data-driven methodology focused on improving processes by reducing defects and variability.
TQM (Total Quality Management): A management approach emphasizing continuous improvement and customer satisfaction across all organizational processes.
Post Mortem A retrospective analysis of an event to identify successes and areas for improvement.
Scientific Method A systematic process of forming hypotheses, testing them, and analyzing results to draw conclusions.
Iterative A repetitive approach to refining a process or solution through successive cycles.
Discrete Data Data that can only take specific, distinct values.
Ordinal Data with a meaningful order but no consistent interval (e.g., satisfaction ratings).
Nominal (Binary/Attribute): Categorical data without order, such as “yes/no” or “male/female.”
Continuous Data Data that can take any value within a range, such as temperature or time.
Risk Priority Number (RPN) A score in FMEA used to prioritize risks, calculated as Severity × Occurrence × Detection.
5 Whys A technique to identify the root cause of a problem by repeatedly asking “why” until the root cause is found.
Fishbone Diagram (Ishikawa) A visual tool used to identify and categorize potential causes of a problem.
Fault Tree Analysis (FTA) A deductive analysis method used to identify the root causes of system failures.
PDCA (Plan-Do-Check-Act) A cyclical process for continuous improvement in workflows or systems.
SIPOC A high-level process map identifying Suppliers, Inputs, Processes, Outputs, and Customers.

ProLUG Links ⛓️

ProLUG Admin Course Unit 14 🐧

December 22, 2024

Ansible Automation

Ansible is an open-source automation tool used for configuration management, application deployment, and IT orchestration, enabling tasks to be executed on multiple systems simultaneously without the need for agents. It uses simple YAML-based playbooks and SSH for communication, making it efficient and easy to learn for managing infrastructure.

Discussion Post 1:

Refer to your Unit 5 scan of the systems. You know that Ansible is a tool that you want to maintain in the environment. Review this online documentation: https://docs.ansible.com/ansible/latest/inventory_guide/intro_inventory.html

Make an inventory of the servers, grouped any way you like.

[warewulf] 192.168.200.25

[ubuntu] 192.168.200.[101:103] 192.168.200.[201:203]

[rockynodes] 192.168.200.[51:69]

What format did you choose to use for your inventory?

INI. YAML seems to be the clear choice as it allows for a more declarative inventory. However, while working in the study group it was easier to edit INI without making indentation errors.

What other things might you include later in your inventory to make it more useful?

I can add quite a few interesting things to an inventory file to make it more useful. Once the file reaches a certain size, it is better to break it out into separate unit files that are nested for things like Hosts, Host Variables, Production, Staging etc…

Some notable things I think make things pretty powerful are:

Dynamic Inventorying
Vault encrypted secrets
Jinja2 Dynamic variables

Discussion Post 2:

You have been noticing drift on your server configurations, so you want a way to generate a report on them every day to validate the configurations are the same.

Use any lab in here to find ideas: https://killercoda.com/het-tanis/course/Ansible- Labs Use this webhook to send your relevant data out to our sandbox. https://discord.com/api/webhooks/1317659221604433951/uyKpuq8fNNNSEyCra4n33PakI Bk-XtTn1WrwTpHs9BcgkIu7URPV_Gd5HJCRX0_EJVUT

name: System Information Check hosts: all become: yes tasks:
- name: Check kernel version command: uname -r register: kernel_version
- name: Display kernel version debug: msg: “Kernel Version: {{ kernel_version.stdout }}”
- name: Display kernel command line command: cat /proc/cmdline register: kernel_cmdline
- name: Debug kernel command line debug: msg: “Kernel Command Line: {{ kernel_cmdline.stdout }}”
- name: Check hardware information command: lshw register: hardware_info
- name: Debug hardware information debug: msg: “Hardware Information: {{ hardware_info.stdout }}”
- name: List installed RPM packages command: rpm -qa register: installed_rpms
- name: Debug installed RPM packages debug: msg: “Installed RPMs: {{ installed_rpms.stdout }}”
- name: Check active services command: systemctl list-units –type=service –state=running register: active_services
- name: Debug active services debug: msg: “Active Services: {{ active_services.stdout }}”
- name: List system users command: cat /etc/passwd register: system_users
- name: Debug system users debug: msg: “System Users: {{ system_users.stdout }}”
- name: Check last login information command: last register: last_login
- name: Debug last login information debug: msg: “Last Login Information: {{ last_login.stdout }}”
- name: Display currently logged-in users command: w register: logged_in_users
- name: Debug currently logged-in users debug: msg: “Logged-in Users: {{ logged_in_users.stdout }}”
- name: Display ISO time command: date –iso-8601=seconds register: iso_time
- name: Debug ISO time debug: msg: “ISO Time: {{ iso_time.stdout }}”

Discussion Post 3:

Using ansible module for git, pull down this repo: 👍 https://github.com/het-tanis/HPC_Deploy.git

How is the repo setup?

We have 3 playbooks in the root of the directory. These playbooks utilize roles defined in the roles subdirectory. Playbook 01 gathers facts about nfs using roles defined in a subdirectory called roles. Playbook 02 gathers data from a target system using roles defined in a subdirectory called data-gather. Playbook 03 updates and installs using roles defined in a subdirectory called packages_update/tasks & packages_install/tasks.

What is in the roles directory?

Partially answered for the previous question. Roles are structured with dedicated directories for tasks, handlers, and templates.

How are these playbooks called, and how do roles differ from tasks?

These playbooks incorporate roles based on specific conditions, executing the tasks defined within each role. When a role is included, the playbook inherits all its contents.

Digging Deeper

I have a large amount of labs to get you started on your Ansible Journey (all free):

https://killercoda.com/het-tanis/course/Ansible-Labs 👍

Find projects from our channel Ansible-Code, in Discord and find something that is

interesting to you. 👍

Use Ansible to access secrets from Hashicorp Vault: https://killercoda.com/het-

tanis/course/Hashicorp-Labs/004-vault-read-secrets-ansible 👍

Lab Work 🧪 👍

Create an inventory:

While still in the /root/ansible_madness directory, create a file hosts vi /root/ansible_madness/hosts Populate the file as follows

[servers]

192.168.200.101
192.168.200.102
192.168.200.103

Run Ad Hoc commands against your servers This will test your connection into all the servers.

1. ansible servers -i hosts -u inmate -k -m shell -a uptime

Use this password: LinuxR0cks1!

Do the same thing, but this time be verbose

2. ansible -vvv servers -i hosts -u inmate -k -m shell -a uptime

Create a playbook to push over files.

3. echo “This is my file ” > somefile

4. vi deploy.yaml

Populate it as follows:

name: Start of push playbook

hosts: servers
vars:
gather_facts: True
become: False
tasks:
name: Copy somefile over at {{ ansible_date_time.iso8601_basic_short }}
copy:
src: /root/ansible_madness/somefile
dest: /tmp/somefile.txt

5. Execute your playbook

ansible-playbook -i hosts -k deploy.yaml

6. Verify that your file pushed everywhere

ansible servers -i hosts -u inmate -k -m shell -a “ls -l /tmp/somefile” Pull down a github repo

git clone https://github.com/het-tanis/HPC_Deploy.git cd HPC_Deploy What do you see in here? What do you need to learn about more to deploy some of these tools? Can you execute some of these, why or why not?

Reflection Questions

What questions do you still have about this week?
How can you apply this now in your current role in IT? If you’re not in IT, how can you look to put something like this into your resume or portfolio?

ProLUG Links ⛓️

ProLUG Admin Course Unit 13 🐧

December 15, 2024

System Hardening

Linux system hardening involves securing the system by reducing its attack surface through measures such as disabling unnecessary services, enforcing access controls, applying security patches, and using tools like OpenSCAP, STIG compliance frameworks, or the OSCAP Scanner. These tools help automate security audits, enforce compliance standards, and identify vulnerabilities to enhance system security.

Discussion Post 1:

Your security team comes to you with a discrepancy between the production security baseline and something that is running on one of your servers in production. There are 5 servers in a web cluster and only one of them is showing this behavior. They want you to account for why something is different.

How are you going to validate that the difference between the systems?

I am going to assume that I am new to the system in general and have very surface knowledge from fellow staff. I am also assuming we are working with a redhat based system.

Starting off simple

Maybe the problem is an obvious one, so I would just start off with a glance.

Quick cursory check (Kernel Version uname-a)
Manually Checking Logs (Journalctl, dmesg, audit.log, syslog)
Checking ports (Socket Statistics, ss -ntulp)
Listing installed packages (DNF list, RPM -qa)
Listing users and Logins (/etc/passwd, w, last)
Seeing what System D services are running (systemctl list-units - -type=service)
Digging for documentation and or commit history

Deeper ⛏️

If no low hanging fruit were there, then I would check configurations

Grub.conf, FirewallD/Apparmour, SELinux,

Sorting 🪰‘ish from 🌶️

If I do that see something distinctly different, I would employ a more sophisticated approach with difference checking. Given that everything is a structured file, I can append the output from a working system and the goose 🪿 to a new file and run diff against them.

Diff’ing the Logs
Diff’ing Socket Statistics
Diff’ing Installed Packages

What are you going to look at to explain this?

I think I have answered this above.

What could be done to prevent this problem in the future?

Introducing or Improving the change management policy and employing version control would be my first suggestion.
Ensuring that there is a build/test/deploy pipeline that integrates tightly with change management.
Using IaC and Automation to ensure consistency and repeatability with tools like Ansible, Packer, Podman or Kubernetes.
Hardening systems with either simple policies or through the guidance of STIG’s.
Introducing stronger controls over user privileges like employing RBA policies.

Discussion Post 2:

Your team has been giving you more and more engineering responsibilities. You are being asked to build out the next set of servers to integrate into the development environment. Your team is going from RHEL 8 to Rocky 9.4.

How might you start to plan out your migration?

Observe

Firstly I would gather system information

Benchmark/baseline performance metrics and utilization (Disk, I/O, PS, Connections etc)
Configs (Scripts and configuration files)
Installed Packages.
users (Listing users and privileges)
Policies (Firewall, SELinux)
Purpose (Assessing the use of a particular system to see if may need changes/upgrades)

Capture

I would snapshot the current system if possible
If a complete snapshot copy is not possible, I would gather files essential to rebuilding a replica

Reconstruct

Build it in a test VM emulating the current environment
Template the VM for experimental changes (Adding additional tools or Configs)

Analyze / Optimize

Gather business or operational requirements, perhaps the system needs enhancements
Experiment with performance tuning
Test new packages and/or configurations

Build

During the analysis and optimization phase, I would start a playbook with information gathered from previous phases. I would build and run the playbook against VM templates until satisfied.

Deploy

Given the prior phases, my Playbook would be robust and capable of the transition. However, I would ensure a robust backup and rollback plan in the case something fails.

What are you going to check on the existing systems to baseline your build?

Compute Usage
Memory Load
Disk Resources
Networking Metrics

What kind of validation plan might you use for your new Rocky 9.4 systems?

I would have a seperate playbook built that would validate performance against what I was observing during my VM experimentation. Though the environment may differ from that of the VM, I would still be able to discern performance characteristics and notice any outlier differences.

Digging Deeper

Run through this lab: https://killercoda.com/het-tanis/course/Linux-Labs/107-server-startup-process 👍

How does this help you better understand the discussion 13-2 question?

Well when I am gathering a picture of my current security baseline, I can use some of these tools like dmesg and ss to see what possible attack surface I may have.

Run through this lab: https://killercoda.com/het-tanis/course/Linux-Labs/203-updating-golden-image 👍

How does this help you better understand the process of hardening systems?

Reflection Questions

What questions do you still have about this week?
How can you apply this now in your current role in IT? If you’re not in IT, how can you look to put something like this into your resume or portfolio?