Tech Blog

ProLUG Admin Course Unit 12 🐧

November 30, 2024

Baselining & Benchmarking

The purpose of a baseline is not to find fault, load, or to take corrective action. A baseline simply determines what is. You must know what is so that you can test against that when you make a change to be able to objectively say there was or wasn’t an improvement. You must know where you are at to be able to properly plan where you are going. A poor baseline assessment, because of inflated numbers or inaccurate testing, does a disservice to the rest of your project. You must accurately draw the first line and understand your system’s performance.

Discussion Post 1:

Your manager has come to you with another emergency. He has a meeting next week to discuss capacity planning and usage of the system with IT upper management. He doesn’t want to lose his budget, but he has to prove that the system utilization warrants spending more.

What information can you show your manager from your systems?

You could present your manager with a progressive trend graph showing time on the x-axis and several fields on the y-axis that represent changes from a baseline, assuming the necessary data has been collected. With this information, it would be possible to predict when various system resources will reach their maximum capacity.

What type of data would prove system utilization? (Remember the big 4: compute, memory, disk, networking)
CPU load, process execution time, throughput. Disk Operations (IOPS). Networking Requests and Bandwidth. RAM utilization, memory paging/swapping rates.
What would your report look like to your manager?

Capacity Planning Report

Current and projected system utilization. By examining trends over time, we can predict when critical resources will reach their limits if no additional capacity is provisioned.

Key Areas of Focus

Compute Usage
Memory Load
Disk Resources
Networking Metrics

Historical Data and Trends

Below is a sample progressive trend graph over the last 6 months. The x-axis represents time (in weeks), while the y-axis shows percentage utilization relative to an established baseline.

Example Metrics (relative to baseline):

CPU Utilization (% of baseline)
Memory Load (% of baseline)
Disk IOPS (% of baseline)
Network Throughput (% of baseline)
Time (Weeks): 1 2 3 4 5 6 … 20 21 22 CPU Util(%): 50 52 55 57 60 62 … 80 82 85 Memory(%): 45 47 50 50 52 55 … 70 73 75 Disk IOPS(%): 30 32 35 36 38 40 … 60 63 68 Network(%): 40 42 45 47 49 51 … 75 78 80

As time progresses, each of the key metrics is trending upward, indicating increasing load and approaching capacity thresholds.

Projections

We can estimate the “time to ceiling” for critical resources. For instance, if CPU load is rising at an average rate of 2–3% per month, and we know that at 90% utilization the system will experience performance degradation.

Projected Time to CPU Ceiling: 3–5 months
Projected Time to Memory Ceiling: 6–8 months
Projected Time to Disk IOPS Ceiling: 8–10 months
Projected Time to Network Bandwidth Ceiling: 4–6 months

Recommendations

Compute: Consider adding more CPU cores or upgrading processors before reaching the predicted 90% utilization mark.
Memory: Upgrade RAM or optimize applications to reduce memory footprint.
Disk: Enhance disk subsystems or switch to faster storage (e.g., SSDs) to handle projected IOPS.
Networking: Increase network capacity (e.g., from 1Gb to 10Gb links) or optimize network traffic.

Conclusion

Investment in scaling resources now will prevent future performance bottlenecks, ensuring the system can continue to meet business demands effectively.

Discussion Post 2:

You are in a capacity planning meeting with a few of the architects. They have decided to add 2 more agents to your Linux Sytems, Bacula Agent and an Avamar Agent . They expect these agents to run their work starting at 0400 every morning.

What do these agents do? (May have to look them up)

Bacula is an open-source suite of tools designed to automate backup tasks. It’s widely regarded for its flexibility and reliability. Dell Avamar, on the other hand, is a commercial backup automation solution. Both tools handle incremental backups using custom daemons that monitor changes over time, offering greater sophistication than simple scheduling systems like Cron. Additionally, they can manage backups across diverse, heterogeneous storage environments.

Do you think there is a good reason not to use these agents at this timeframe?

This approach is about balancing workload. If all processes start at a fixed time, they can consume valuable resources simultaneously. The best schedule depends on the environment. For example, if the environment experiences downtime—such as a traditional office setting—starting backups at 4 a.m. might be fine. However, if services run around the clock, it’s better to stagger the tasks so they use only a fraction of the available resources at any given time. This approach also reduces the impact of failures, since not all systems are involved at once.

Is there anything else you might want to point out to these architects about these agents they are installing?

There are several factors architects should consider. However, in the context of this discussion, performance overhead is particularly relevant. They need to ensure that the chosen backup solutions won’t overburden the system’s resources and that there’s enough “breathing room” to maintain smooth operations.

Discussion Post 3: ‘TODO’

Your team has recently tested at proof of concept of a new storage system. The vendor has published the blazing fast speeds that are capable of being run through this storage system. You have a set of systems connected to both the old storage system and the new storage system.

Write up a test procedure of how you may test these two systems.

I did a bit of research regarding tooling for such a task and found FIO ‘Flexible Input / Output’, a program written for the purpose of testing systems with various scenarios. Rather than using BASH, I can run more comprehensive testing with more data to analyze using FIO.

Baseline Test

 fio --filename=/dev/new_storage_lun --direct=1 --rw=read --bs=128k --size=10G --numjobs=1 --iodepth=32 --runtime=300 --time_based --name=new_storage_seq_read
 fio --filename=/dev/old_storage_lun --direct=1 --rw=read --bs=128k --size=10G --numjobs=1 --iodepth=32 --runtime=300 --time_based --name=old_storage_seq_read

Running Mixed Workload Tests

 fio --filename=/dev/new_storage_lun --direct=1 --rw=randrw --rwmixread=70 --bs=4k --size=10G --numjobs=4 --iodepth=16 --runtime=300 --time_based --name=new_storage_mixed

Increased Concurrency and Scale

Increase numjobs and iodepth in subsequent runs to measure how performance changes:
fio –filename=/dev/new_storage_lun –direct=1 –rw=read –bs=128k –size=10G –numjobs=8 –iodepth=64 –runtime=300 –time_based –name=new_storage_high_concurrency
Run the same tests on the old storage system and record all metrics.

Stress/Soak Tests

12-hour continuous I/O test on both storage systems.
fio –filename=/dev/new_storage_lun –direct=1 –rw=randwrite –bs=4k –size=100G –numjobs=1 –iodepth=32 –runtime=43200 –time_based –name=new_storage_soak

AWK Line Parsing

I would then pipe the output of these commands to AWK to seperate out specific datapoints to append to files for full analysis.

How are you assuring these test are objective?

By gathering multiple datasets with varying run parameters, I can reduce statistical noise and better isolate data of interest by comparing these datasets against one another.

What is meant by the term Ceteris Paribus, in this context?

in the context of system benchmarking means that when measuring the performance of one specific aspect of the system, all other variables and conditions are kept constant. This approach ensures that any observed changes in performance can be attributed directly to the variable under test, rather than being influenced by unrelated fluctuations in the environment or system load.

Definitions & Terminology

Benchmark
High watermark
Scope
Methodology
Testing
Control
Experiment
Analytics
Descriptive
Diagnostic
Predictive
Prescriptive

Digging Deeper (optional)

Analyzing data may open up a new field of interest to you. Go through some of the free lessons on Kaggle, here: https://www.kaggle.com/learn

a. What did you learn?

b. How will you apply these lessons to data and monitoring you have already collected as a system administrator?

Find a blog or article that discusses the 4 types of data analytics.

a. What did you learn about past operations? b. What did you learn about predictive operations?

Download Spyder IDE (Open source)

a. Find a blog post or otherwise try to evaluate some data. b. Perform some Linear regression. My block of code (but this requires some additional libraries to be added. I can help with that if you need it.)

import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression

size = [[5.0], [5.5], [5.9], [6.3], [6.9], [7.5]] price =[[165], [200], [223], [250], [278], [315]] plt.title(‘Pizza Price plotted against the size’)

plt.xlabel(‘Pizza Size in inches’)

plt.ylabel(‘Pizza Price in cents’)

plt.plot(size, price, ‘k.’)

plt.axis([5.0, 9.0, 99, 355])

plt.grid(True)

model = LinearRegression()

model.fit(X = size, y = price)

#plot the regression line

plt.plot(size, model.predict(size), color=‘r’)

Reflection Questions

What questions do you still have about this week?
How can you apply this now in your current role in IT? If you’re not in IT, how can you look to put something like this into your resume or portfolio?

Digging Deeper

1. Read the rest of the chapter https://sre.google/workbook/monitoring/ and note anything else of interest when it comes to monitoring and dashboarding.

2. Look up the “ProLUG Prometheus Certified Associate Prep 2024” in Resources -> Presentations in our ProLUG Discord. Study that for a deep dive into Prometheus.

3. Complete the project section of “Monitoring Deep Dive Project Guide” from the prolug-projects section of the Discord. We have a Youtube video on that project as well. https://www.youtube.com/watch?v=54VgGHr99Qg

Labs

https://killercoda.com/het-tanis/course/Linux-Labs/102-monitoring-linux-logs

https://killercoda.com/het-tanis/course/Linux-Labs/103-monitoring-linux-telemetry

https://killercoda.com/het-tanis/course/Linux-Labs/104-monitoring-linux-Influx-Grafana

While completing each lab think about the following:

a. How does it tie into the diagram below?

b. What could you improve, or what would you change based on your previous administration experience.

Install Grafana on the Rocky Linux system by adding the Grafana repo manually. Red = Inputs Blue = Outputs

Create a new repository configuration sudo vim /etc/yum.repos.d/grafana.repo

Paste:

[grafana] name=grafana baseurl=https://packages.grafana.com/oss/rpm
repo_gpgcheck=1 enabled=1 gpgcheck=1 gpgkey=https://packages.grafana.com/gpg.key sslverify=1

Verify using the DNF

sudo dnf repolist

sudo dnf clean - verifies whether files are working

Should see:

repo id repo name appstream Rocky Linux 8 - AppStream baseos Rocky Linux 8 - BaseOS extras Rocky Linux 8 - Extras grafana grafana/spl

Check the grafana package on the official repository

sudo dnf info grafana

Should see something similar 👀

Importing GPG key 0x24098CB6: Userid : “Grafana " Fingerprint: 4E40 DDF6 D76E 284A 4A67 80E4 8C8C 34C5 2409 8CB6 From : https://packages.grafana.com/gpg.key Is this ok [y/N]: y

Should see 👀

Name : grafana Version : 8.2.5 Release : 1 rchitecture : x86_64 Size : 64 M Source : grafana-8.2.5-1.src.rpm Repository : grafana Summary : Grafana URL : https://grafana.com License : “Apache 2.0” Description : Grafana

Install Grafana sudo dnf install grafana -y

⏳ Takes a while…

Restart SystemD unit sudo systemctl enable –now grafana-server

Verify sudo systemctl status grafana-server

5.5. Firewall (Security) Firewall is Managed by files present in /etc/firedwalld

cd /etc/firewalld/ ls -l

cp /usr/lib/firewalld/services/ssh.xml /etc/firewalld/services/example.xml

sudo firewall-cmd –add-service=grafana –permanent

sudo firewall-cmd –add-port=3000/tcp –permanent sudo firewall-cmd –reload

Create config file sudo vim /etc/grafana/grafana.ini
Change the default value of:

The option ‘http_addr’ to ’localhost’, the ‘http_port’ to ‘3000’, and the ‘domain’ option to your domain name as below. For this example, the domain name is ‘grafana.example.io’.

For non-standard port, be sure to uncomment ; [server] 👍 http_port = 4000 👍

The public facing domain name used to access grafana from a browser domain = grafana.example.io

7.1 Turn off the nasty default report of analytics 👺 [analytics] reporting_enabled = false

7.2. Restart the grafana service to apply a new configuration.

sudo systemctl restart grafana-server

Reverse Proxy Setup

Install NGINX

sudo dnf install nginx -y

Create a new server block for grafana

/etc/nginx/conf.d/grafana.conf

Required to proxy Grafana Live WebSocket connections

map $http_upgrade $connection_upgrade { default upgrade; ’’ close; } server { listen 80; server_name grafana.example.io; rewrite ^ https://$server_name$request_uri? permanent; } server { listen 443 ssl http2; server_name grafana.example.io; root /usr/share/nginx/html; index index.html index.htm; ssl_certificate /etc/letsencrypt/live/grafana.example.io/fullchain.pem; ssl_certificate_key /etc/letsencrypt/live/grafana.example.io/privkey.pem; access_log /var/log/nginx/grafana-access.log; error_log /var/log/nginx/grafana-error.log; location / { proxy_pass http://localhost:3000/; }

Proxy Grafana Live WebSocket connections location /api/live { rewrite ^/(.*) /$1 break; proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection $connection_upgrade; proxy_set_header Host $http_host; proxy_pass http://localhost:3000/; } }

Next, verify the Nginx configuration

sudo nginx -t

Should see 👀

nginx: the configuration file /etc/nginx/nginx.conf syntax is ok nginx: configuration file /etc/nginx/nginx.conf test is successful 👍

Start and enable the Nginx service sudo systemctl enable –now nginx sudo systemctl status nginx

Install Prometheus (Saturday)

Rocky Prometheus Install

Add New User and Directory ‘prometheus’

create a new configuration directory and data directory for the Prometheus installation.

sudo adduser -M -r -s /sbin/nologin prometheus

create a new configuration directory ‘/etc/prometheus’ and the data directory ‘/var/lib/prometheus’

(Only needed for running as service)

sudo mkdir /etc/prometheus sudo mkdir /var/lib/prometheus

Note: All Prometheus configuration at the ‘/etc/prometheus’ directory, and all Prometheus data will automatically be saved to the directory ‘/var/lib/prometheus’. Installing Prometheus on Rocky Linux

Install Prometheus monitoring system manually from the tarball or tar.gz file.

Change the working directory to ‘/usr/src’ and download the Prometheus binary

cd /usr/src wget https://github.com/prometheus/prometheus/releases/download/v3.0.1/prometheus-3.0.1.linux-amd64.tar.gz

Extract

tar -xzf ***.tar.gz

cd into folder

Run bin:

./bin 👍

If bin works, proceed

Copy all Prometheus configurations to the directory ‘/etc/prometheus’ and the binary file ‘prometheus’ to the ‘/usr/local/bin’ directory.

Move prometheus configuration ‘prometheus.yml’ to the directory ‘/etc/prometheus.

sudo mv $PROM_SRC/prometheus.yml /etc/prometheus/

Move the binary file ‘prometheus’ and ‘promtool’ to the directory ‘/usr/local/bin/’.

sudo mv $PROM_SRC/prometheus /usr/local/bin/ sudo mv $PROM_SRC/promtool /usr/local/bin/

Move Prometheus console templates and libraries to the ‘/etc/prometheus’ directory.

sudo mv -r $PROM_SRC/consoles /etc/prometheus sudo mv -r $PROM_SRC/console_libraries /etc/prometheus

Edit Prometheus configuration ‘/etc/prometheus/prometheus.yml’

vim /etc/prometheus/prometheus.yml

On the ‘scrape_configs’ option, you may need to add monitoring jobs

The default configuration comes with the default monitoring job name ‘prometheus’ and the target server ’localhost’ through the ‘static_configs’ option.

Change the target from ’localhost:9090’ to the server IP address ‘192.168.1.10:9090’ as below.

Note: Scrape configuration containing exactly one endpoint to scrape: Here it’s Prometheus itself. scrape_configs: The job name is added as a label job=<job_name> to any timeseries scraped from this config. job_name: “prometheus”

metrics_path defaults to ‘/metrics’ scheme defaults to ‘http’.

static_configs: targets: [“192.168.1.10:9090”]

Change the configuration and data directories to the user ‘promethues’.

sudo chown prometheus:prometheus /etc/prometheus sudo chown prometheus:prometheus /var/lib/prometheus

Basic prometheus installation finished, Hopefully 👍 .

Configure Prometheus

Create a new systemd service sudo vim /etc/systemd/system/prometheus.service

Copy and paste the following configuration.

[Unit] Description=Prometheus Wants=network-online.target After=network-online.target

[Service] User=prometheus Group=prometheus Type=simple ExecStart=/usr/local/bin/prometheus
–config.file /etc/prometheus/prometheus.yml
–storage.tsdb.path /var/lib/prometheus/
–web.console.templates=/etc/prometheus/consoles
–web.console.libraries=/etc/prometheus/console_libraries

[Install] WantedBy=multi-user.target

Reload the systemd manager to apply a new config.

sudo systemctl daemon-reload

Start and enable the Prometheus service

sudo systemctl enable –now prometheus sudo systemctl status prometheus

Prometheus monitoring tool is now accessible on the TCP port ‘9090.

Visit IP address with port ‘9090’

http://192.168.1.10:9090/

And you will see the prometheus dashboard query below.

Prometheus query dashboard

Reflection Questions

What questions do you still have about this week?
How can you apply this now in your current role in IT? If you’re not in IT, how can you look to put something like this into your resume or portfolio?

ProLUG Links ⛓️

Discord: https://discord.com/invite/m6VPPD9usw Youtube: https://www.youtube.com/@het_tanis8213 Twitch: https://www.twitch.tv/het_tanis ProLUG Book: https://leanpub.com/theprolugbigbookoflabs KillerCoda: https://killercoda.com/het-tanis

ProLUG Admin Course Unit 11 🐧

November 28, 2024

Monitoring Systems

In this unit, we explore monitoring systems, which often consist of multiple interconnected components. At its core, monitoring involves carefully exposing system data and transmitting it to tools for analysis and alerting. From my experience with Prometheus and Grafana—two widely used and versatile solutions—I’ve seen how effective these tools can be for various scenarios. However, many other tools are also available. One of my key takeaways from the unit’s readings and labs was the importance of careful data exposure. Much like setting permissions in a Linux system, it’s crucial to determine what data can be accessed and who is allowed to see it in the reporting chain. System information, if mishandled, can easily become a double-edged sword.

Discussion Post 1

Scenario

You’ve heard the term “loose coupling” thrown around the office about a new monitoring solution coming down the pike. You find a good resource and read the section on “Prefer Loose Coupling” https://sre.google/workbook/monitoring/

1. What does “loose coupling” mean, if you had to summarize to your junior team

members?

Loose coupling means the components of a system can operate independently, yet still work together when combined. This design allows individual components to be swapped or replaced with minimal disruption to the overall system. In contrast, a strongly coupled system binds its components so tightly that altering one would disrupt or even break the entire system’s functionality.

2. What is the advantage given for why you might want to implement this type of tooling in your monitoring? Do you agree? Why or why not?

The advantage of a loosely coupled monitoring system lies in its flexibility to evolve over time. Systems change, requirements shift, and new tools emerge, making adaptability essential. A design that allows components to be replaced or upgraded with minimal disruption is highly valuable—not only for an organization aiming to maintain efficiency but also for the administrators and engineers responsible for ensuring stability and resolving issues.

3. They mention “exposing metrics” what does it mean to expose metrics? What happens to metrics that are exposed but never collected?

Exposing metrics involves making system information accessible for monitoring and analysis. However, this must be approached with caution, as exposing such information can introduce vulnerabilities. Simply exposing data without actively collecting or utilizing it needlessly increases security risks without providing any benefit.

Discussion Post 2

Scenario

Your HPC team is asking for more information about how CPU 0 is behaving on a set of servers. Your team has node exporter writing data out to Prometheus (Use this to simulate https://promlabs.com/promql-cheat-sheet/).

1. Can you see the usage of CPU0 and what is the query?

Yes one can use a query that focuses on the metrics provided by the Node Exporter. Specifically, filter for CPU 0 and its usage.

100 - (avg by (instance) (irate(node_cpu_seconds_total{cpu="0", mode="idle"}

2. Can you see the usage of CPU0 for just the last 5 minutes and what is the query?

Yes

100 - (avg by (instance) (rate(node_cpu_seconds_total{cpu="0", mode="idle"}[5m])) * 100)

3. You know that CPU0 is excluded from Slurm, can you exclude that and only pull the user and system for the remaining CPUs and what is that query?

Yes

sum by (instance) (rate(node_cpu_seconds_total{cpu!="0", mode=~"user|system"}[5m])) * 100

Digging Deeper

1. Read the rest of the chapter https://sre.google/workbook/monitoring/ and note anything else of interest when it comes to monitoring and dashboarding.

2. Look up the “ProLUG Prometheus Certified Associate Prep 2024” in Resources -> Presentations in our ProLUG Discord. Study that for a deep dive into Prometheus.

3. Complete the project section of “Monitoring Deep Dive Project Guide” from the prolug-projects section of the Discord. We have a Youtube video on that project as well. https://www.youtube.com/watch?v=54VgGHr99Qg

Labs

https://killercoda.com/het-tanis/course/Linux-Labs/102-monitoring-linux-logs

https://killercoda.com/het-tanis/course/Linux-Labs/103-monitoring-linux-telemetry

https://killercoda.com/het-tanis/course/Linux-Labs/104-monitoring-linux-Influx-Grafana

While completing each lab think about the following:

a. How does it tie into the diagram below?

b. What could you improve, or what would you change based on your previous administration experience.

1. Install Grafana

1.1 Create a New Repository Configuration

sudo vim /etc/yum.repos.d/grafana.repo

Paste the following:

[grafana]
name=grafana
baseurl=https://packages.grafana.com/oss/rpm
repo_gpgcheck=1
enabled=1
gpgcheck=1
gpgkey=https://packages.grafana.com/gpg.key
sslverify=1

1.2 Verify the Repository

sudo dnf repolist
sudo dnf clean all

Expected Output:

repo id                                      repo name
appstream                                    Rocky Linux 8 - AppStream
baseos                                       Rocky Linux 8 - BaseOS
extras                                       Rocky Linux 8 - Extras
grafana                                      grafana/spl

1.3 Check the Grafana Package

sudo dnf info grafana

Example Output:

Importing GPG key 0x24098CB6:
Userid     : "Grafana"
Fingerprint: 4E40 DDF6 D76E 284A 4A67 80E4 8C8C 34C5 2409 8CB6
From       : https://packages.grafana.com/gpg.key
Is this ok [y/N]: y

1.4 Install Grafana

sudo dnf install grafana -y

1.5 Enable and Start Grafana Service

sudo systemctl enable --now grafana-server
sudo systemctl status grafana-server

1.6 Configure Firewall

sudo firewall-cmd --add-service=grafana --permanent
sudo firewall-cmd --add-port=3000/tcp --permanent
sudo firewall-cmd --reload

1.7 Create and Edit Configuration File

sudo vim /etc/grafana/grafana.ini

Update the following settings:

[server]
http_port = 4000
domain = grafana.example.io

[analytics]
reporting_enabled = false

Restart the service:

sudo systemctl restart grafana-server

2. Reverse Proxy Setup with NGINX

2.1 Install NGINX

sudo dnf install nginx -y

2.2 Create a Server Block for Grafana

sudo vim /etc/nginx/conf.d/grafana.conf

Paste the configuration:

map $http_upgrade $connection_upgrade {
  default upgrade;
  '' close;
}

server {
  listen 80;
  server_name grafana.example.io;
  rewrite ^ https://$server_name$request_uri? permanent;
}

server {
  listen 443 ssl http2;
  server_name grafana.example.io;

  ssl_certificate /etc/letsencrypt/live/grafana.example.io/fullchain.pem;
  ssl_certificate_key /etc/letsencrypt/live/grafana.example.io/privkey.pem;

  location / {
    proxy_pass http://localhost:3000/;
  }

  location /api/live {
    proxy_http_version 1.1;
    proxy_set_header Upgrade $http_upgrade;
    proxy_set_header Connection $connection_upgrade;
    proxy_pass http://localhost:3000/;
  }
}

2.3 Verify NGINX Configuration

sudo nginx -t

Expected Output:

nginx: the configuration file /etc/nginx/nginx.conf syntax is ok
nginx: configuration file /etc/nginx/nginx.conf test is successful

2.4 Start and Enable NGINX

sudo systemctl enable --now nginx
sudo systemctl status nginx

3. Install Prometheus

3.1 Add a New User and Directories

sudo adduser -M -r -s /sbin/nologin prometheus
sudo mkdir /etc/prometheus
sudo mkdir /var/lib/prometheus

3.2 Download and Extract Prometheus

cd /usr/src
wget https://github.com/prometheus/prometheus/releases/download/v3.0.1/prometheus-3.0.1.linux-amd64.tar.gz
tar -xzf prometheus-3.0.1.linux-amd64.tar.gz
cd prometheus-3.0.1.linux-amd64

3.3 Move Files to Appropriate Locations

sudo mv prometheus /usr/local/bin/
sudo mv promtool /usr/local/bin/
sudo mv consoles /etc/prometheus
sudo mv console_libraries /etc/prometheus
sudo mv prometheus.yml /etc/prometheus/

3.4 Change Ownership

sudo chown -R prometheus:prometheus /etc/prometheus /var/lib/prometheus

3.5 Configure Prometheus

sudo vim /etc/prometheus/prometheus.yml

Example scrape configuration:

scrape_configs:
  - job_name: "prometheus"
    static_configs:
      - targets: ["192.168.1.10:9090"]

3.6 Create a Systemd Service

sudo vim /etc/systemd/system/prometheus.service

Paste the following:

[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target

[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/usr/local/bin/prometheus     --config.file /etc/prometheus/prometheus.yml     --storage.tsdb.path /var/lib/prometheus/     --web.console.templates=/etc/prometheus/consoles     --web.console.libraries=/etc/prometheus/console_libraries

[Install]
WantedBy=multi-user.target

3.7 Start Prometheus Service

sudo systemctl daemon-reload
sudo systemctl enable --now prometheus
sudo systemctl status prometheus

3.8 Access Prometheus

Visit http://192.168.1.10:9090 in your browser to view the Prometheus dashboard.

Reflection Questions

What questions do you still have about this week?
How can you apply this now in your current role in IT? If you’re not in IT, how can you look to put something like this into your resume or portfolio?

ProLUG Links ⛓️

ProLUG Admin Course Unit 10 🐧

November 22, 2024

Kubernetes

“The OS of the Internet”

This week, we were privileged to host a special guest lecture by John Champine, Scott’s brother. Over the course of an engaging two-hour session, John delivered an in-depth exploration of Kubernetes, thoroughly covering the five W’s: Who, What, When, Where, and Why.

John’s passion and deep knowledge of Kubernetes were evident throughout the presentation. Having firsthand experience with the challenges of pre-Kubernetes infrastructure, he offered valuable insights into how this platform has revolutionized modern computing. Notably, John specializes in OpenShift, an IBM-owned management layer built atop Kubernetes. OpenShift adds additional functionality and ease of use to what is already a powerful but complex system.

One concept that particularly caught my attention was the fractionalization of CPU and memory resources made possible by Kubernetes’ sophisticated resource management. John introduced the term millicore, a concept I was previously unfamiliar with. It refers to the fine-grained allocation of processing power, where fractions of a CPU core are shared across processes during compute cycles. This ability to manage resources at such a granular level is remarkable, showcasing the efficiency and precision of Kubernetes.

Before this lecture, I never imagined that such details—down to the microsecond allocation of core usage—could not only be considered but also controlled and utilized to optimize workloads. This level of resource management truly solidifies Kubernetes’ position as the “operating system of the internet,” enabling applications to run more efficiently and reliably across diverse infrastructures.

John’s insights not only deepened my understanding of Kubernetes but also sparked curiosity about the broader implications of containerized resource management in modern computing.

One idea that has been dispelled by doing these exercises is that Kubernetes is overkill for most things. Yes, standing up multiple networked nodes and having them interoperate is no small task. However, there exist lightweight, single node variants like K3s that enable a simplified experience while maintaining the powerful advantages of an orchestration system.

Furthermore, given the threat landscape, desire for declarative infrastructure and ever fluctuating demand, it makes sense to implement a tool that is designed for these modern demands.

Discussion Post 1

Reading: https://kubernetes.io/docs/concepts/overview/ 📗

Question 1

What are the two most compelling reasons you see to implement Kubernetes in your organization?

Efficient Usage of Resources: Kubernetes optimizes resource allocation and ensures workloads are distributed efficiently across your infrastructure.
Built-in Redundancy: Kubernetes provides automatic failover and self-healing capabilities, ensuring high availability for applications.

Other notable reasons include:

Load Balancing: Automatically distributes traffic across pods to maintain application performance. ⚖️
Centralized Control: Manages deployments, updates, and scaling from a single interface. 🎯
Declarative in Nature: Uses declarative configurations, making it easier to maintain and reproduce states. 🔖
Customizable for Organizational Needs: Offers flexibility to adapt to specific requirements through pluggable components. 🔌
Open Source: Eliminates vendor lock-in while benefiting from a large and active community. ⚙️
Broadly Used: Extensive community adoption ensures a wealth of resources, tools, and discussions. 📜
Written in Go: Delivers a lightweight, high-performance foundation that I am irrationally a fan of 😉

Question 2

When the article says Kubernetes is not a PaaS, what do they mean by that? What is a PaaS in comparison?

Kubernetes is not a Platform-as-a-Service (PaaS) because it operates at the container level rather than the application or hardware level. While it provides some features similar to PaaS offerings, Kubernetes emphasizes flexibility, composability, and user choice rather than prescribing a monolithic solution.

Key Characteristics of Kubernetes Compared to PaaS:

What Kubernetes Is:
- Provides foundational building blocks for deploying, scaling, and managing containerized applications.
- Is modular, with optional and pluggable solutions for logging, monitoring, alerting, and scaling.
- Focuses on maintaining desired state through independent, composable control processes.
- Preserves user choice by not dictating application or configuration specifics.
What Kubernetes Is Not:
- Does not limit the types of applications it supports.
- Does not deploy source code or build your application.
- Does not provide application-level services (e.g., middleware, databases).
- Does not enforce specific logging, monitoring, or alerting solutions.
- Does not include a proprietary configuration system or language (e.g., Jsonnet).
- Does not manage comprehensive machine-level configuration, maintenance, or self-healing.

Comparison to PaaS:

A Platform-as-a-Service (PaaS), such as Heroku or OpenShift, provides a more opinionated, all-encompassing solution by:

Supporting application deployment directly from source code.
Managing application-level services like databases or caching layers.
Offering integrated logging, monitoring, and alerting.
Providing a simplified developer experience but with less flexibility.

In contrast, Kubernetes gives users the tools to build their own developer platforms while leaving critical choices in the user’s hands.

Discussion Post 2

Scenario

You get a ticket about your new test cluster. The team is unable to deploy some of their applications. They suspect there is a problem and send you over this output:

Kubernetes Cluster Information

kubectl Version

Component	Version
Client Version	v1.31.6+k3s3
Kustomize Version	v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version	v1.30.6+k3s1

Node Status

NAME	STATUS	ROLES	AGE	VERSION
Test_Cluster1	Ready	control-plane,master	17h	v1.30.6+k3s1
Test_Cluster2	NotReady	worker	33m	v1.29.6+k3s1
Test_Cluster3	Ready	worker	17h	v1.28.6+k3s1

Question 1

What are you checking on the cluster to validate you see their error?

To identify and validate the issue with the node Test_Cluster2:

Check the overall cluster status

Run kubectl get nodes to see the status of all nodes in the cluster.
This confirms Test_Cluster2 is NotReady.

Inspect node details

Run kubectl describe nodes Test_Cluster2 to check for events, taints, and resource usage issues. Look for errors related to kubelet, network, or node conditions.

Access the node**

SSH into the node with ssh Test_Cluster2 to perform further diagnostics.

Verify kubelet status

Run systemctl status kubelet to check if the kubelet service is running and healthy.

Check container runtime

Depending on your runtime, run either systemctl status docker or systemctl status podman to ensure the container runtime is operational.

Reload and restart services

If issues are detected, attempt to reload the systemd daemon (systemctl daemon-reload) and restart kubelet (systemctl restart kubelet).

Question 2

What do you think the problem could be?

Potential problems for Test_Cluster2 being NotReady include:

The Test_Cluster2 node is running Kubernetes version v1.29.6+k3s1, which is older than the server and other nodes (v1.30.6+k3s1). This may cause compatibility issues.
The kubelet service might have failed or not started.
Networking problems could prevent the node from communicating with the control plane.
The node may lack sufficient CPU, memory, or disk space for the kubelet to function.
Taints, labels, or configuration errors might prevent the node from becoming Ready.
Hardware issues or kernel module problems may impact the node’s health.

Question 3

Do you think someone else has tried anything to fix this problem before you? Why or why not?

Yes, someone may have tried to fix it

The node was added only 33 minutes ago, indicating recent activity.
It’s common for issues to be noticed and an initial attempt made before escalating.
systemctl status may show if services were restarted recently, suggesting prior troubleshooting.

No, it may not have been addressed yet because:

The node is still in the NotReady state, implying no successful resolution so far.
Lack of detailed documentation or escalation might mean no one has investigated yet.

To confirm, check the node’s event logs (kubectl describe nodes Test_Cluster2) or system logs (journalctl -u kubelet) for evidence of recent actions.

Discussion Post 3

Scenario:

You are the network operations center (NOC) lead. Your team has recently started supporting the dev, test, and QA environments for your company’s K8s cluster. Write up a basic checkout procedure for your new NOC personnel to verify operation of the cluster before escalating on critical alerts.

Kubernetes Cluster Checkout Procedure for NOC Personnel

This document outlines the basic steps for verifying the operation of the Kubernetes (K8s) cluster before escalating on critical alerts.

1. Verify Cluster Health

Access the cluster using the kubectl command-line tool.
Use the appropriate kubeconfig file for the environment (dev, test, QA).
kubectl config use-context

- Check Cluster Nodes

    kubectl get nodes

Look for any nodes in NotReady, Unknown, or SchedulingDisabled status. Investigate any anomalies.

- Check System Pods

Verify critical system pods in the kube-system namespace are running
```
  kubectl get pods -n kube-system
```

Pay special attention to core components:

kube-apiserver
kube-scheduler
kube-controller-manager
etcd

2. Verify Namespace and Application Status

- List All Namespaces

    kubectl get namespaces

- Check Pod Status in Active Namespaces

    kubectl get pods -n <namespace>

- Inspect Deployments and ReplicaSets

    kubectl get deployments -n <namespace>
    kubectl get rs -n <namespace>

3. Verify Cluster Networking

- Service Connectivity

    kubectl get svc -n <namespace>

- DNS Resolution

    kubectl exec -n kube-system <coredns-pod> -- nslookup kubernetes.default

- Ingress/Load Balancer

    kubectl get ingress -n <namespace>

4. Monitor Resource Utilization

Node Resource Usage

    kubectl top nodes

Pod Resource Usage

    kubectl top pods -n <namespace>

5. Check Cluster Events

    kubectl get events --sort-by='.metadata.creationTimestamp'

6. Validate Logging and Monitoring

Ensure logging systems (e.g., EFK, Fluentd) are collecting logs as expected.
Verify monitoring dashboards (e.g., Prometheus, Grafana) for anomalies or missing metrics.

7. Escalation Criteria

Core system pods (kube-apiserver, etcd) are not running.
A node remains in NotReady for more than 10 minutes.
Pods in critical application namespaces are in CrashLoopBackOff or Error state without resolution.
Networking issues persist and cannot be resolved by basic checks.
Cluster resource utilization is critically high (e.g., >90% CPU or memory).

8. Document Findings

Record all findings in the incident management system, including:
Steps taken during the checkout procedure.
Observed errors or anomalies.
Commands run and their output (if significant).

End of Document

What information online helped you figure this out? What blogs or tools did you use?

https://kubernetes.io/docs/tasks/debug/debug-cluster/
https://kubernetes.io/docs/concepts/
From conducting study-groups and working with K3s

What did you learn in this process of writing this up?

Kubernetes is an Operating System with similar failure modes to that of Linux
Kubernetes maintains helpful logs
Kubernetes gives helpful errors
KubeCTL is the tool I should internalize and master
KubeCTL can manage multiple clusters remotely

Digging Deeper ⛏️

Work more in the lab to build a container of your choice and then find out how to deploy that into your cluster in a secure, scalable way. 👍

I have actually experimented with building out custom containers using Podman and using:

    podman commit

to export the image as a custom local image. Then, I run that image and input the command:

    podman generate kube my-container > my-container.yaml

This creates a manifest yaml that can then be spun up with Kubernetes.

Read this about securing containers: https://docs.docker.com/build/building/best- practices

I do have prior experience with this having completed the DevSecOps certification from TryHackMe, it covers containers security with red-team exercises. However, it does not cover a ton of defensive measures. This best practices document is going to be really helpful when setting up future containers.

Do this to practice securing those containers. https://killercoda.com/killer- shell-cks/scenario/static-manual-analysis-docker 👍

Read these about securing Kubernetes Deployments:

https://kubernetes.io/docs/concepts/security/ 👍 and https://kubernetes.io/docs/concepts/security/pod-security-standards/

TLS
Secrets API
Runtime Classes
Network Policies
Admission Policy
Pod Security standards
Open Policy, assumed safe privileged user.
Baseline Policy
SELinux enforce Policy
Do this lab to practice securing Kubernetes: https://killercoda.com/killer- shell-cks/scenario/static-manual-analysis-k8s

So checking the privilege status indicates a container is privleged or not. If privleged, move to insecure directory.

Labs 🥼🧪

Warmup

    curl -sfL https://get.k3s.io > /tmp/k3_installer.sh 👍
    
    more /tmp/k3_installer.sh 👍

What do you notice in the installer?

What checks are there?

    grep -i arch /tmp/k3_installer.sh

What is the name of the variable holding the architecture?

How is the system finding that variable?

    uname -m

Verify your system architecture

    grep -iE “selinux|sestatus” /tmp/k3_installer.sh

Does K3s check if selinux is running, or no? Yes 👍

Installing k3s and looking at how it interacts with your Linux system

Install k3s curl -sfL https://get.k3s.io | sh -

What was installed on your system?

    rpm -qa –last | tac

Your tasks in this lab are designed to get you thinking about how container deployments interact with

Verify the service is running
```
 systemctl status k3s
```
Check it systemd configuration
```
 systemctl cat k3s
```

See what files and ports it is using

 ss -ntulp | grep <pid from 3>

 lsof -p <pid from 3>

#Do you notice any ports that you did not expect?

Verify simple kubectl call to API
```
 kubectl get nodes
```
Verify K3s is set to start on boot and then cycle the service
```
 systemctl is-enabled k3s

 systemctl stop k3s
```

Recheck your steps 3-6

What error do you see, and is it expected?

What is the API port that kubectl is failing on?

    systemctl start k3s

Verify your normal operations again.

Looking at the K3 environment

Check out components

 kubectl version

 kubectl get nodes

 kubectl get pods -A

 kubectl get namespaces

 kubectl get configmaps -A

 kubectl get secrets -A

#Which namespace seems to be the most used?

Creating Pods, Deployments, and Services

It’s possible that the lab will fail in this environment. Continue as you can and identify the problem using

the steps at the end of this section.

Create a pod named webpage with the image nginx
```
 kubectl run webpage --image=nginx
```
Create a pod named database with the image redis and labels set to tier=database
```
 kubectl run database --image=redis --labels=tier=database
```

Create a service with the name redis-service to expose the database pod within the cluster on port

 6379 (default Redis port)
 kubectl expose pod database --port=6379 --name=redis-service --
 type=ClusterIP

Create a deployment called web-deployment using the image nginx that has 3 replicas
```
 kubectl create deployment web-deployment --image=nginx --
 replicas=3
```

Verify that the pods are created

 kubectl get deployments
 kubectl get pods

Create a new namespace called my-test
```
 kubectl create namespace my-test
```
Create a new deployment called redis-deploy with the image redis in your my-test namespace with 2
```
 replicas
 kubectl create deployment redis-deploy -n my-test --image=redis -
 -replicas=2
```

Do some of your same checks from before. What do you notice about the pods you created? Did they all work?

If this breaks in the lab, document the error. Check your disk space and RAM, the two tightest constraints in the lab. Using systemctl restart k3s and journalctl -xe can you figure out what is failing? (Rocky boxes may have limitations that cause this to not fully deploy, can you find out why?)

Reflection questions:

What questions do you still have about this week?

I would like to know more about Kubernetes and how it can be effectively managed, but this is just a matter of further study.

How can you apply this now in your current role in IT? If you’re not in IT, how can you look to put something like this into your resume or portfolio?

I installed K3S on my personal lab and ran through two days of exercises with the study group, gaining some competency with installing, configuring and deploying a set of pods from manifests. So I could confidently state that I know how to do this and check on the health of the system.

ProLUG Links ⛓️

ProLUG Admin Course Unit 9 🐧

November 12, 2024

Containers & Kubernetes 🦭

One of the most exciting units for me has been exploring deployment and hosting infrastructure. I’ve already spent some time working with Docker and Podman, but I have had limited hands-on experience with Kubernetes. Before this unit, I completed a few interactive Kubernetes labs on Killercoda, covering basic commands, information gathering, and logging.

This week, I followed an interactive K3s lab that guided me through the installation process—a perfect refresher. Afterward, I jumped onto one of my Proxmox VMs and installed K3s on my homeLab 👨‍🔧

Discussion Post 1

It’s a slow day in the NOC and you have heard that a new system of container deployments are being used by your developers. Do some reading about containers, docker, and podman.

1. What resources helped me answer these questions

https://www.redhat.com/en/topics/containers
My RHCSA 9 Course on Udemy
Notes I have composed in LogSeq from multiple sources
Julia Evans Blog https://jvns.ca/

2. What did you learn about that you didn’t know before?

I did not know that Podman and Kubernetes can run WASM Applications alongside containers. I have some interest in WASM and Subsequent ByteCode, having read about it on occasion.
I did not know that Podman containers can be converted to SystemD services.
I learned a technique that I quite like using podman commit to create a custom compose file from a modified container and will likely use this alot.

Terminology that I wasn’t familiar with:

Control Plane: Manages container orchestration, monitoring, and state across cluster nodes.
The API server: Core interface for communication between users and container clusters.
Scheduler: Assigns containers to nodes based on resource availability and policies.

3. What seems to be the major benefit of containers?

Can be declarative with compose. So infrastructure can be explicitly defined and easily rebuilt.
Light weight / low resource. Containers are not complete systems and are stripped to the bare essentials, meaning they are very small files that run fast.

4. What seems to be some obstacles to container deployment?

From my own experience, mounting persistent volumes was a bit tricky.
Container networking presents a challenge bot conceptually and practically.
Packaging application to run harmoniously within a container environment presents some friction.
Large infrastructure must be broken into microservices, introducing complexity.

Discussion Post 2

You get your first ticket about a problem with containers. One of the engineers is trying to move his container up to the Dev environment shared server. He sends you over this information about the command he’s trying to run.

[developer1@devserver read]$ podman ps
CONTAINER ID  IMAGE       COMMAND     CREATED     STATUS      PORTS       NAMES

[developer1@devserver read]$ podman images

REPOSITORY                TAG                IMAGE ID      CREATED      SIZE
localhost/read_docker     latest             2c0728a1f483  5 days ago   68.2 MB
docker.io/library/python  3.13.0-alpine3.19  9edd75ff93ac  3 weeks ago  47.5 MB
[developer1@devserver read]$ podman run -dt -p 8080:80/tcp docker.io/library/httpd
You decide to check out the server 
[developer1@devserver read] ss -ntulp
Netid   State    Recv-Q   Send-Q      Local Address:Port        Peer Address:Port         Process
udp     UNCONN   0        0           127.0.0.53%lo:53               0.0.0.0:*             users:(("systemd-resolve",pid=166693,fd=13))
tcp     LISTEN   0        80              127.0.0.1:3306             0.0.0.0:*             users:(("mariadbd",pid=234918,fd=20))
tcp     LISTEN   0        128               0.0.0.0:22               0.0.0.0:*             users:(("sshd",pid=166657,fd=3))
tcp     LISTEN   0        4096        127.0.0.53%lo:53               0.0.0.0:*             users:(("systemd-resolve",pid=166693,fd=14))
tcp     LISTEN   0        4096                    *:8080                   *:*             users:(("node_exporter",pid=662,fd=3))

1. What do you think the problem might be? 🔍

There is a container call node exporter that is listening on port 8080, therefore the port is already in use. I think this is a pretty common issue as this port is normally used for public traffic. With many nodes running it is easy to double assign a port.

2. How will you test this? 🤔

I would run the conflicting container run command with a slight port change. podman run -dt -p 8081:80/tcp docker.io/library/httpd

3. The developer tells you that he’s pulling a local image, do you find this to be true, or is something else happening in their run command?

It is true that once and image is pulled, it is stored locally. So the developer may have pulled the image. However in the command he is specifying a source for pulling a fresh container, so the dev is definitely sus. Typically if the image has been pulled it is given a container ID, which is then used to build with.

Installing K3s in my HomeLab 👍

After completing the suggested lab, I took note of installation process and replicated it on my HomeLab. I just wanted to share my process and bumps.

1. Installation from curl script

curl -sfL https://get.k3s.io | sh -

2. Making sure the service is running in systemD

systemctl status k3s

3. Changing config permissions (Problem that had stumped me)

sudo chmod 644 /etc/rancher/k3s/k3s.yaml

4. Deploying a pod

kubectl run nginx –image=nginx:alpine

5. Making an alias so commands are less annoying

alias k=kubectl

6. Checking the running pod

k get pods

Digging Deeper ⛏️

1. See if you can get a deployment working in the lab.

Screen shots from this deployment

Step 1
Step 2
Step 3
Step 4
Step 5
Step 6
Step 7
Step 8
Step 9
Step 10

What worked well?

creating a persistent volume and attaching to the volume worked well. mkdir /root/TreasuresVolume Pulling and building an image worked well. Attaching to the container and interacting with it went well. podman run -dit –name TreasuresContainer -v /root/TreasuresVolume:/app docker.io/library/golang:alpine tail -f /dev/null apk add vim gcc bash

What did you have to troubleshoot?

I ran into trouble when exiting the container. It will kill the container, forcing me to start it again, which was frustrating. Having done this in the past, I thought this would not be an issue. However, I just learned that containers without continually running services will die when exited. The fix for such an issue is to run something light and persistent on the container in order to keep it alive. This can be accomplished with a bash script or turning a go binary into a system binary to keep running.

What documentation can you make to be able to do this faster next time?

Actually, this blog is used to partially keep track of this. I use LogSeq as a second brain, where I will definitely copy and paste this info into my Podman section.

Reflection Questions🤔

1. What questions do you still have about this week?

I would like to know more about small scale deployments of Kubernetes such as the one I deployed using Rancher and K3S. More specifically I would like to know what limitations a small scale deployment has versus a multi machine / multi node system.

2. How can you apply this now in your current role in IT? If you’re not in IT, how can you look to put something like this into your resume or portfolio?

This content is highly applicable to my intended work as deploying services is becoming increasingly important, as evidenced by the requirement for knowledge of Podman in the RHCSA 9. I already have some container deployment work to display to prospective employers atm. I plan to further my knowledge in this area, especially in regard to Kubernetes as this seems to be the operating system of the internet.

ProLUG Links ⛓️

ProLUG Admin Course Unit 8 🐧

October 30, 2024

Scripting System Checks

Arnold

Once again beyond the Discussion Posts and Labbing. I spent a great deal of time scripting/programming System checks. After completing the labs which Bash Scripting and intro to ‘C’, I got really into GO as a system util. I have a particularly productive day with using the embed.fs feature of GO and packing unix system tools together in a single go program at compilation. I think there is a ton of potential here for my own uses. 👨‍🔧

Discussion Post 1

Scenario

It’s a 2-week holiday in your country, and most of the engineers and architects who designed the system are out of town. You’ve noticed a pattern of logs filling up on a set of web servers due to increased traffic. Research and verification show that the logs are being sent off in real time to Splunk. Your team has been deleting the logs every few days, but a 3rd-shift engineer missed this in the notes, causing downtime. How might you implement a simple fix to stop-gap the problem until all engineering resources return next week?

Resources Used:
TryHackMe (Splunk) Intro
Study Group discussion
ChatGPT
Blogs:
Nohl’s Tech Resources: Log Files in tmpfs Without Breaking Logging
DietPi Documentation: Log System

Why can’t you just make a design fix and add space in /var/log on all these systems?

Adding more space to /var/log might be a design fix, but it isn’t feasible in the short term due to:

Operational Constraints: Extending storage may involve downtime, additional permissions, or architectural changes that can’t be approved without the primary engineers.
Temporary Nature of the Fix: Increasing space only delays the issue. If logs continue to grow, the problem will recur once space is exhausted again.

Why can’t you just make a design change and use logrotate more frequently?

Possibility of Log Loss: Higher logrotate frequency could still miss high-frequency log spikes, especially during unusual traffic peaks, risking logs being deleted before Splunk ingestion is complete.
Configuration and Testing: Aggressive logrotate adjustments may interfere with processes expecting logs at specific retention periods. Testing changes in production without key team members isn’t ideal.

Temporary Fix Options

To address the issue, consider implementing a temporary fix by configuring a log retention policy that aggressively compresses or truncates logs without disrupting active processes. Here are some potential approaches:

Implement a Temporary Cron Job

Schedule a cron job to truncate logs on a more aggressive schedule without deleting them. For example:

> /var/log/access.log

```bash

  > /var/log/access.log

This would empty the log file without removing it or impacting the active file descriptors held by any running processes.

Set Up Temporary Log Compression

Compress the logs after truncation if additional space savings are needed. Tools like gzip can compress logs efficiently, reducing disk space usage and ensuring logs are still accessible if required for audits or incident investigations.

Implement a RAM Disk for Temporary Logs

As a short-term measure, you could set up a RAM disk for logs that don’t need long-term retention. This allows logs to be stored temporarily in memory, reducing disk space pressure. For instance


  mount -t tmpfs -o size=512M tmpfs /var/log/temp

You could then configure lower-priority logs to write here temporarily, knowing they will be lost upon reboot, which may be acceptable in a crisis scenario.

Adjust Splunk Forwarder Configuration:

If possible, configure the Splunk forwarder to filter logs more aggressively, reducing the volume of logs that are retained on the system. The props.conf or inputs.conf files can be configured to forward logs without keeping local copies.

Adding more space to /var/log might be a design fix, but it isn’t feasible in the short term due to the following:

Operational Constraints: Extending storage could involve downtime, additional permissions, or changes that require architectural decisions that can’t be made without the primary engineers.
Temporary Nature of the Fix: Increasing space only delays the issue rather than preventing it. If the logs keep growing, the problem will recur once space runs out again.

Discussion Post 2

You are the only Linux Administrator at a small healthcare company. The engineer/admin before you left you a lot of scripts to untangle. This is one of our many tasks as administrators, so you set out to accomplish it. You start to notice that he only ever uses nested if statements in bash. You also notice that every loop is a conditional while true and then he breaks the loop after a decision test each loop. You know his stuff works, but you think it could be more easily written for supportability, for you and future admins. You decide to write up some notes by reading some google, AI, and talking to your peers.

Compare the use of nested if versus case statement in bash.

Nested if statements are useful for situations where each condition depends on the result of the previous test, requiring a hierarchy or sequence.
A case statement is ideal for handling multiple discrete values of a variable, especially if there are many possible branches. It’s typically cleaner and more readable than a nested if.

Compare the use of conditional and counting loops. Under what circumstances

Use conditional loops (while) when you don’t know the number of iterations in advance and need to loop based on conditions.
Use counting loops (for) when you have a set number of iterations or are working with a list. This structure is clearer and prevents issues that may arise from unintentional infinite loops.

would you use one or the other?

optimizing or refactoring Bash scripts the Engineer had left me.

I would replace nested if statements with case statements when possible to improve readability, especially when handling multiple discrete values.
Of course, I would comment things for added communication/ maintainability.
I would Limit while true loops to cases where no predictable count or list is available. Clearly define a break condition early to avoid infinite loops.
I would Use for loops for counting or iterating over arrays or lists, as they provide a clean structure with known iteration limits.

ProLUG Links ⛓️

ProLUG Admin Course Unit 7 🐧

October 27, 2024

Security

Patching the system/ Package Management

yum, dnf, rpm

Packaged Cat

As the course progresses, I am learning more deeply within a study group. Aside from the lab work and discussion posts. I have been putting a lot of hours satisfying curiosities regarding the linux system. For this unit we did a deep dive into packaging, going so far as to look at the history and reasoning before decision making. We had also looked at packages are managed within tightly controlled environments. I now feel that I have a robust understanding of the theory and practical elements of Redhat packaging and beyond.

Discussion Post 1:

Why software versioning is important. 🤔

Versioning enables you to monitor software updates systematically, making it easier to troubleshoot, roll back changes, and trace modifications for security or functionality purposes. 👍
One can manage dependencies confidently, avoiding conflicts between system components, libraries, and tools. crucial for stable and consistent deployments. 👍
We can verify package integrity, ensuring installed software hasn’t been altered or corrupted. Essential for maintaining a secure and stable system environment. 👍

Discussion Post 2:

Scenario:

You are new to a Linux team. A ticket has come in from an application team and has already been escalated to your manager. They want software installed on one of their servers but you cannot find any documentation and your security team is out to lunch and not responding. You remember from some early documentation that you read that all the software in the internal repos you currently have are approved for deployment on servers. You want to also verify by checking other servers that this software exists. This is an urgent ask and your manager is hovering. How can you check all the repos on your system to see which are active? 🤔 How would you check another server to see if the software was installed there? 🤔 If you find the software, how might you figure out when it was installed? (Time/Date) 🤔

Answer:

In an urgent situation like this, I’d first check which approved software repositories are active on my system, then verify if the software is already installed on similar servers to ensure it’s safe to proceed. Finally, I’d review the installation history to confirm when it was added. Working with Red Hat packaging and package management systems has many more options than I was expecting; through labbing in the study group, I’ve gained a much better understanding of packages, dependencies, and package management.

Packing was a pretty deep rabbit hole for me.

This is the process I’d follow for this case:

Check Active Repositories dnf repolist
Check if the Software is Installed on Your System rpm -qa | grep <software_name> or dnf list installed <software_name>
Check Another Server for Software Installation with SSH and Step 2 commands.
View Installation History (Time/Date)
dnf history info <transaction_id> dnf history list <software_name>

Discussion Post 3

(After you have completed the lab) - Looking at the concept of group install from DNF or Yum. Why do you think an administrator may never want to use that in a running system? Why might an engineer want to or not want to use that? This is a thought exercise, so it’s not a “right or wrong” answer it’s for you to think about.

Question:

What is the concept of software bloat, and how do you think it relates?

Answer:

Software bloat is when essential tools/packages are larger than they need to be, effecting performance, reliability and security. By performance I am referring to the loss of potential performance from unessecary resource use. In regards to reliability, more complex systems inherently have more potential to fail. Security means many things, so I am specifically thinking about attack surface and potential for vulnerability due to the aforementioned complexity.

Question:

What is the concept of a security baseline, and how do you think it relates?

Answer:

A set of minimum security standards and controls that organizations implement to protect systems.

Question:

How do you think something like this affects performance baselines?

Answer:

By targeting specific packages, tracking changes, reducing unnessecary dependancies and bloat, we satisfy the tenants of a security baseline by establishing consistency, simplifying compliance, enhancing efficiency and reduce risk.

ProLUG Links ⛓️

ProLUG Admin Course Unit 6 🐧

October 21, 2024

Security – Firewalld & UFW 🔥🧱

In this unit, we explore essential concepts in firewall management using Firewalld and UFW. A firewall acts as a security system, controlling the flow of traffic between networks by enforcing rules based on zones —logical areas with different security policies. Services are predefined sets of ports or protocols that firewalls allow or block, and zones like DMZ (Demilitarized Zone) provide added security layers by isolating public-facing systems. Stateful packet filtering tracks the state of connections, allowing more dynamic rules, while stateless packet filtering inspects individual packets without connection context. Proxies facilitate indirect network connections for security and privacy, while advanced security measures such as Web Application Firewalls (WAF) and Next-Generation Firewalls (NGFW) offer specialized protection against modern threats.

Lab Work 🧪🥼

Intro

This week, we dove deep into configuring and testing firewall settings in our Discord study group. I had several virtual machines set up in my ProxMox home lab, and we experimented while completing the lab work. As usual, we went on several tangents, verifying ideas. In total, we spent over 5 hours running commands, experimenting with different configurations, breaking things, and debating solutions. By the end of the session, I gained a practical understanding of Firewalld configuration, packet sending, and packet tracing with Wireshark. It was frustrating at times, but ultimately rewarding.

Sending & Receiving test packets experiment

1. So we set up one server to receive a packet:

    nc -l -p -u 12345 > received_file

2. Sending from another server and noted that the packet was recorded in received_file

    echo 'message1' | nc -u server_b_ip 12345

3. We set Firewalld to block udp port 12345

    sudo firewall-cmd --zone=public --add-rich-rule='rule family="ipv4" source address="192.168.2.166" port protocol="udp" port="12345" accept' --permanent

4. So we set up one server to receive a packet:

    nc -l -p -u 12345 > received_file

5. Sending from another server:

    echo 'Blocked' | nc -u server_b_ip 12345

6. We then noted that this packet did not come through

7. We then set up netcat to listen for packets on udp port 12346 ‘slightly different’

    nc -l -p -u 12346 > received_file

8. Sending from another server and noted that the packet was recorded in received_file

    echo 'message2' | nc -u server_b_ip 12346

9. Then we busted open WireShark on the packet receiving VM and turned on general packet scanning

    echo 'mom' | nc -u server_b_ip 12346

10. We then looked at all of the ways we could inspect the packets including additional data bits over UDP for no apparent reason.

Fin

Types of Firewalls 🔍

Firewalld 🔥🧱

Uses zones to define the level of trust for network connections, making it easy to apply different security settings to various types of connections (like home, public, or work). It’s dynamic, meaning changes can be made without restarting the firewall, ensuring smooth operation.

Zones 🍱

The concept is specific to Firewalld. Zones are a predefined set of firewall rules that determine the level of trust assigned to a network connection. Zones allow you to apply different security policies to different network interfaces based on how much you trust the network.

Common Commands

    firewall-cmd --state

Checks if Firewalld is running.

    firewall-cmd --get-active-zones

Lists all active zones and the interfaces associated with them.

    firewall-cmd --get-default-zone

Displays the default zone for new interfaces or connections.

    firewall-cmd --set-default-zone=ZONE

Changes the default zone to the specified zone.

    firewall-cmd --zone=ZONE --add-service=SERVICE

Allows a service (e.g., SSH, HTTP) in the specified zone.

    firewall-cmd --zone=ZONE --remove-service=SERVICE

Removes a service from the specified zone.

    firewall-cmd --zone=ZONE --add-port=PORT/PROTOCOL

Opens a specific port (e.g., 80/tcp) in the specified zone.

    firewall-cmd --zone=ZONE --remove-port=PORT/PROTOCOL

Closes a specific port in the specified zone.

    firewall-cmd --reload

Reloads the Firewalld configuration without dropping active connections.

    firewall-cmd --list-all

Lists all the rules and settings in the active zone.

    firewall-cmd --permanent

Applies changes permanently (used with other commands to ensure changes persist after reboots).

    firewall-cmd --runtime-to-permanent

Converts the current runtime configuration to a permanent one.

Zone example

A public zone might have stricter rules, blocking most traffic except for essential services like web browsing.
A home zone could allow more open traffic, such as file sharing, because the network is more trusted.

UFW 🔥🧱

Uncomplicated Firewall is a user-friendly firewall designed to simplify the process of controlling network traffic by allowing or blocking connections. UFW is commonly used on Ubuntu and provides easy commands for setting up firewall rules, making it ideal for beginners. Despite it is simplicity, it is powerful enough to handle complex configurations.

Default Deny Policy 🔐

By default, UFW denies all incoming connections while allowing outgoing ones. This enhances security by requiring users to explicitly allow any incoming traffic.

Common Commands

    sudo ufw status

Displays the current status of UFW and active rules.

    sudo ufw enable

Enables the UFW firewall.

    sudo ufw disable

Disables the UFW firewall.

    sudo ufw default deny incoming

Sets the default policy to deny all incoming connections.

    sudo ufw default allow outgoing

Sets the default policy to allow all outgoing connections.

    sudo ufw allow PORT

Allows traffic on a specific port (e.g., sudo ufw allow 22 to allow SSH).

    sudo ufw deny PORT

Denies traffic on a specific port.

    sudo ufw delete allow PORT

Removes a previously allowed rule for a port.

    sudo ufw allow SERVICE

Allows traffic for a service by name (e.g., sudo ufw allow ssh).

    sudo ufw deny SERVICE

Denies traffic for a service by name.

    sudo ufw allow from IP

Allows traffic from a specific IP address.

    sudo ufw deny from IP

Denies traffic from a specific IP address.

    sudo ufw allow proto PROTOCOL from IP to any port PORT

Allows traffic for a specific protocol, source IP, and port (e.g., sudo ufw allow proto tcp from 192.168.1.100 to any port 80).

    sudo ufw reset

Resets all UFW rules to default.

    sudo ufw reload

Reloads UFW rules without disabling the firewall.

WAF 🔥🧱

Web Application Firewall is a security system designed to protect web applications by filtering and monitoring HTTP traffic between a web application and the internet. It helps prevent common web-based attacks like SQL injection, cross-site scripting (XSS), and cross-site request forgery (CSRF) by analyzing the incoming and outgoing traffic and blocking malicious requests. Unlike traditional firewalls that focus on network security, a WAF specifically targets the security of web applications and can be an important part of a layered defense strategy.

More Sophisticated 🍷

are generally more sophisticated than Firewalld or UFW because they operate at the application layer (Layer 7) of the OSI model. Blocking traffic is one thing, but packet inspection is another.

Quite a few ⚖️

There are many Web Application Firewalls out there that cover specific cloud platforms or web services. Here is a list of some popular ones:

AWS WAF
Cloudflare WAF
F5 Advanced WAF
Imperva WAF
ModSecurity
Barracuda WAF
Sucuri WAF
Akamai Kona Site Defender
Fortinet FortiWeb

NGFW 🔥🧱🧠

Next-Generation Firewall is an advanced type of firewall that goes beyond traditional firewall features like packet filtering. It combines standard firewall capabilities with more advanced functionalities such as deep packet inspection (DPI), intrusion prevention systems (IPS), and application-level control. NGWs can inspect and control traffic at a more granular level, allowing administrators to set security rules based on specific applications, users, or behaviors.

Features of a typical NGFW

Deep Packet Inspection (DPI): Examines the content of data packets, not just their headers, allowing the firewall to identify and block threats hidden in the traffic.
Intrusion Detection and Prevention System (IDS/IPS): Monitors network traffic for suspicious activity and can take action (like blocking or alerting) to prevent attacks in real-time.
Application Awareness and Control: Recognizes and manages specific applications (e.g., Facebook, Skype) regardless of port or protocol, allowing for fine-grained traffic control.
Advanced Malware Protection (AMP): Detects and blocks malware using both signature-based detection and behavioral analysis to prevent malware from entering the network.
SSL/TLS Decryption: Decrypts encrypted traffic (HTTPS) for inspection to detect threats hiding inside secure channels.
User Identity Integration: Applies firewall rules based on user identity or group membership rather than just IP addresses, providing more flexible access control.
Threat Intelligence Feeds: Uses real-time threat data from global databases to protect against emerging threats and malicious IP addresses or domains.
Cloud-Delivered Security: Provides scalable and flexible cloud-based protection services such as sandboxing, traffic analysis, and updates for zero-day attacks.
Virtual Private Network (VPN) Support: Allows secure, encrypted connections for remote users or between different networks (site-to-site or remote access VPNs).
URL Filtering: Controls access to websites based on categories (e.g., social media, gambling) or specific URLs, helping enforce acceptable use policies.
Quality of Service (QoS): Prioritizes certain types of traffic, ensuring that critical applications receive the necessary bandwidth and reducing congestion.
Zero-Trust Network Segmentation: Implements policies based on strict access control, ensuring that users and devices only access the resources they are explicitly permitted.
Sandboxing: Isolates suspicious files or code in a secure environment to detect malicious behavior without affecting the rest of the network.
Logging and Reporting: Provides detailed logs and reports on network traffic, blocked threats, and firewall activity for auditing and troubleshooting.

NGFW Products

Palo Alto Networks NGFW
Cisco Firepower NGFW
Fortinet FortiGate NGFW
Check Point NGFW
Sophos XG NGFW
Juniper Networks SRX Series NGFW
Barracuda CloudGen NGFW
SonicWall NGFW
WatchGuard Firebox NGFW
Forcepoint NGFW
PfSense NGFW

Experience with PFSense

I am familiar with PfSense as it is Open Source and popular among the Homelab enthusiasts because it is offers expansive features built upon FreeBSD which has killer networking.

Some limitations

Pre-packaged Features

Commercial NGFWs (e.g., Palo Alto Networks, Cisco Firepower) often come with built-in advanced features such as cloud-delivered threat intelligence, AI-powered threat detection, and sandboxing for zero-day threats. While pfSense can be extended with third-party packages, it doesn’t natively offer the same level of seamless integration or automation.

Unified Management

Commercial NGFWs typically provide a centralized management console for handling multiple firewalls across large networks. While pfSense can handle multiple installations, managing them requires more manual effort and may not be as streamlined as the enterprise-grade management consoles of commercial NGFWs.

Enterprise Support

pfSense relies on community and third-party support, whereas commercial NGFWs offer direct vendor support with service level agreements (SLAs), which can be crucial for large enterprises needing guaranteed response times and assistance.

Threat Intelligence

NGFWs like those from Palo Alto or Cisco often integrate with real-time global threat intelligence networks, offering constant updates about emerging threats. While pfSense can be configured with tools like Snort for intrusion detection, it lacks the built-in, cloud-powered intelligence found in commercial NGFWs.

Case Study #1 🤔

A ticket has come in from an application team. Some of the servers your team built for them last week have not been reporting up to enterprise monitoring and they need it to be able to troubleshoot a current issue, but they have no data. You jump on the new servers and find that your engineer built everything correctly and the agents for node_exporter, ceph_exporter and logstash exporter that your teams use. But, they also have adhered to the new company standard of firewalld must be running. No one has documented the ports that need to be open, so you’re stuck between the new standards and fixing this problem on live systems.

1. Initial Research

Findings:

node_exporter typically uses port 9100
ceph_exporter may use port 9128
logstash commonly uses ports 5044

2. Checking Configs

A. LogStash Config

    cat /etc/logstash/conf.d/

B. node exporter config

    cat /etc/systemd/system/node_exporter.service

C. ceph exporter config

    cat /etc/systemd/system/ceph_exporter.service

Or

3. Gathering Socket Statistics

    sudo ss -tuln | grep LISTEN

Options Breakdown

-t: Show TCP sockets. -u: Show UDP sockets. -l: Show listening sockets, i.e., those waiting for incoming connections. -n: Show the output numerically, without resolving service names (e.g., display IP addresses and port numbers instead of domain names or service names like “http”).

Q: 1. As you’re looking this up, what terms and concepts are new to you?

Basically all of the concepts used are new to me. I am not very well versed in networking, network scanning or inspecting service configs. So this became a research and practice exercise that has shown me quite a lot of new tricks.

Q: 2. What are the ports that you need to expose? How did you find the answer?

Theoretically I would expose port 9100, 9128, 5044 from research. Furthermore, I now know how to check configs and/or gathering sockets statistics

Q: 3. What are you going to do to fix this on your firewall?

I would add these services to the internal zone:

    firewall-cmd --zone=ZONE --add-service=SERVICE

Allows a service (e.g., SSH, HTTP) in the specified zone.

Case Study #2 🤔

A manager heard you were the one that saved the new application by fixing the firewall. They get your manager to approach you with a request to review some documentation from a vendor that is pushing them hard to run a WAF in front of their web application. You are “the firewall” guy now, and they’re asking you to give them a review of the differences between the firewalls you set up (which they think should be enough to protect them) and what a WAF is doing.

Q: 1. What do you know about the differences now?

Traditional Firewalls (firewalld/ufw):

Operate at network and transport layers (OSI Layer 3 & 4).
Control traffic based on IP addresses, ports, and protocols.
Block or allow entire network connections.

Web Application Firewalls (WAFs):

Operate at the application layer (OSI Layer 7).
Inspect HTTP/HTTPS traffic, focusing on web application security.
Protect against attacks like SQL injection, XSS, and other web vulnerabilities.

Q: 2. What are you going to do to figure out more?

Dedicate time to researching effective WAF solutions.
Identifying suitable solutions at 3 budget scales.
Try to understand the additional labour behind management additional tools.

Q: 3. Prepare a report for them comparing it to the firewall you did in the first discussion.

Report 🗒️

Evaluation of Implementing a Web Application Firewall WAF

Prepared by: Treasure

Date: Oct 20th 2024

Subject: Evaluation of WAF Implementation Suitability and Comparison with Traditional Firewalls

1. Introduction

This report has been prepared in response to a request to evaluate the suitability of implementing a Web Application Firewall (WAF) within our infrastructure. The aim of this report is to:

Compare WAF technology with traditional firewall solutions currently implemented.
Assess the benefits and limitations of each.
Provide recommendations based on the findings.

2. Objectives

The key objectives of this evaluation are:

To determine the suitability of WAF in enhancing our web application security.
To identify potential risks and benefits associated with the deployment of WAF.
To compare traditional firewall solutions with WAF in terms of functionality, security, and cost.
To make recommendations based on the current and future needs of our IT infrastructure.

3. Comparison of Technologies

3.1 Traditional Firewalls (firewalld/ufw)

Primary Function: Control and filter network traffic based on IP addresses, ports, and protocols.
Strengths:
- Blocks unwanted connections at the network level.
- Suitable for general network protection.
- Easy to configure and manage.
Limitations:
- Does not inspect web traffic at the application level.
- Cannot protect against specific web application attacks (e.g., SQL injection, XSS).

3.2 Web Application Firewalls (WAF)

Primary Function: Protect web applications by filtering and monitoring HTTP/HTTPS traffic.
Strengths:
- Protects against common web application vulnerabilities (e.g., SQL injection, XSS).
- Monitors web traffic to block malicious requests.
- Can provide real-time threat detection and logging.
Limitations:
- May require more resources and specialized configuration.
- Focused solely on web applications, not general network traffic.

3.3 Key Differences

Feature	Traditional Firewall	Web Application Firewall (WAF)
Layer	Network (Layer 3/4)	Application (Layer 7)
Traffic Type	IP, ports, protocols	HTTP/HTTPS, web requests
Use Case	General network security	Web application protection
Threat Coverage	Blocks IP-based threats	Mitigates web vulnerabilities (SQLi, XSS)
Cost	Typically lower	Generally higher due to specialized focus

4. Key Considerations for WAF Implementation

4.1 Security Benefits

Enhanced protection against web-specific attacks.
Ability to monitor and block suspicious activity in real-time.
Added layer of security on top of traditional network firewalls.

4.2 Cost Analysis

Initial Investment: The upfront cost of acquiring and configuring a WAF solution.
Ongoing Costs: Maintenance, updates, and potential personnel training.

4.3 Operational Impact

May require additional resources for setup, monitoring, and incident response.
Potential need for collaboration between the security and development teams to ensure smooth integration.

5. Risk Assessment

Without WAF: Increased vulnerability to web application-specific threats, such as cross-site scripting (XSS) and SQL injection, especially for critical applications.
With WAF: Increased security for web applications but requires ongoing monitoring and adjustment to ensure performance and efficacy.

6. Recommendations

Based on the evaluation, I recommend the following:

Implement a WAF: Due to the increasing reliance on web applications and the rise in web-based attacks, implementing a WAF would provide an essential layer of security.
Maintain Traditional Firewalls: Existing firewalls should continue to be used for network-level protection alongside the WAF.
Pilot Implementation: Begin with a limited deployment of WAF for high-risk applications to evaluate performance and cost before a full-scale rollout.
Staff Training: Ensure the security and IT teams are trained in WAF management to maximize its effectiveness.

7. Conclusion

The implementation of a Web Application Firewall is a strategic move to protect our web applications from evolving security threats. While traditional firewalls remain crucial for network security, they cannot defend against the types of attacks WAFs are designed to mitigate. By implementing both WAF and traditional firewall solutions, we can ensure comprehensive security coverage across both network and application layers.

8. Next Steps

Further evaluation of potential WAF solutions based on budget, compatibility, and scalability.
Engage with the security team for a detailed implementation plan.
Prepare a pilot program for critical applications and monitor its performance.

Approved by: Bob Saggit

Date: October 25th 2024

Definitions

Firewall: A security device that monitors and controls incoming and outgoing network traffic based on predetermined security rules.
Zone: A defined area within a network that contains systems with similar security requirements, separated by a firewall.
Service: A specific type of network functionality, like HTTP or DNS, that can be allowed or blocked by a firewall.
DMZ: A “Demilitarized Zone” is a network segment that serves as a buffer between a secure internal network and untrusted external networks.
Proxy: A server that acts as an intermediary for requests between clients and servers, often used for filtering, security, or caching.
Stateful packet filtering: A firewall feature that tracks the state of active connections and makes filtering decisions based on the connection’s state.
Stateless packet filtering: A type of firewall filtering that analyzes each packet independently without considering the state of the connection.
WAF: A Web Application Firewall that protects web applications by filtering and monitoring HTTP/HTTPS traffic for threats like SQL injection and XSS.
NGFW: A Next-Generation Firewall that combines traditional firewall functions with additional features like application awareness, integrated intrusion prevention, and advanced threat detection.

ProLUG Links ⛓️

ProLUG Admin Course Unit 5 🐧

October 18, 2024

Managing Users & Groups

The overarching theme of this Unit is in the title, we are looking at Managing Users & Groups. Managing users and groups in Linux within an enterprise involves creating, modifying, and organizing user accounts and permissions to enforce security and control over resources.

Organizing permissions to enforce security is more important than it has ever been, as we live in a hyper connected world with many bad actors and large amounts of sensitive data.

Linux is fundamentally well suited for Managing Users & Groups because permissions permeate every aspect of a Linux environment. Everything is a file and every file has associated permissions, Therefore we have granular control over the comings and goings of users as administrators.

Lab Work 🧪🥼

Primary Commands / Tools

alias: Creates a shortcut or alias for a command.
unalias: Removes an alias that was previously defined.
awk: A powerful text-processing tool used for pattern scanning and processing.
useradd: Adds a new user to the system.
vi .bashrc: Opens the .bashrc file in the vi editor to customize shell settings.
UID_MIN 1000: The minimum user ID value for normal users (as defined in /etc/login.defs).
UID_MAX 60000: The maximum user ID value for normal users (as defined in /etc/login.defs).
groupadd –g 60001 project: Creates a new group named “project” with a GID of 60001.
id user4: Displays the user ID (UID), group ID (GID), and group memberships of user “user4.”

etc directories

Looking at etc directories relating to Users, Groups and Associated Security
```
  /etc/passwd 
```
contains essential information about users, including their username, user ID (UID), group ID (GID), home directory, and default shell, with each entry separated by a colon.
```
  /etc/group
```
stores group information, listing each group’s name, group ID (GID), and its members, with each entry separated by a colon, allowing users to belong to one or more groups for access control purposes.
```
  /etc/shadow
```
contains encrypted password information and related security details for user accounts, such as password aging and expiration
```
  /etc/gshadow
```
stores encrypted passwords for group accounts, as well as information about group administrators and members, providing enhanced security for group access by restricting who can modify or access specific group data.
```
  /etc/login.defs 
```
configuration settings for user account creation and login parameters, such as password aging policies, UID and GID ranges, and the default paths for user home directories, helping to control system-wide authentication behavior.
```
  /etc/skel/
```
provides template files that are automatically copied to a new user’s home directory when the user is created, ensuring they have default configuration settings.

Other interesting directories

Brief Description
Associated permissions
```
  /etc/fstab
```
This file contains information about disk partitions and other block storage devices and how they should be automatically mounted during the boot process.
Permissions: Usually -rw-r–r– (readable by all users, writable only by the root).
```
  /etc/hostname
```
This file stores the system’s hostname, which is a unique identifier for the machine in a network.
Permissions: Usually -rw-r–r– (readable by all users, writable only by the root).
```
  /proc
```
This is a virtual filesystem that provides detailed information about processes and system resources. It does not contain actual files but rather system and process information in real-time.
Permissions: dr-xr-xr-x (readable and executable by all users, writable only by root).
```
  /boot
```
Contains the kernel, initial ramdisk, and bootloader files needed to start the system.
Permissions: drwxr-xr-x (readable and executable by all users, writable only by root).
```
  /root
```
This is the home directory for the root user (the system administrator).
Permissions: drwx—— (only root has read, write, and execute permissions).
```
  /usr/bin
```
Contains binary executables for user programs.
Permissions: drwxr-xr-x (readable and executable by all users, writable by root).

Mapping unknown infrastructure 🗺️🤔

Objectives:

Map the Internal ProLUG Network (192.168.200.0/24):
Map the network from one of the rocky nodes.
Using a template that you build or find from the internet
Provide a 1 page summary of what you find in the network.

Approach 🤔

A briefing on the infra. 🔍🖥️

Het’ server is unique to me. He uses an injest system that makes a jump to the actual server for security purposes. Within the main server we have a warewulf managed cluster running a series of Rocky Linux VM’s. ⛰

Since we will be doing this from one of the Rocky Nodes within the system, the jump server will not be an issue I recon. 🤔

Mapping Strategy 📍🗺️

So I shouldn’t just pop into the server and go willy nilly with scanning commands. This sever is managed by someone who understands security. So it is best to do some Dead Reckoning beforehand.

The Basic Commands ⌨️

Mapping the remote servers open ports with nmap. nmap stands for Network Mapper. A quick perusal of the man page states it is an exploration tool and security / port scanner. It was designed to rapidly scan large networks. Nmap uses raw IP packets in novel ways to determine what hosts are available on the network, what services (application name and version) those hosts are offering, what operating systems (and OS versions) they are running, what type of packet filters/firewalls are in use, and dozens of other characteristics.

In order to Map the network I am going to use the ip address for the argument/target of the nmap command.

nmap <server-ip>

Checking mounted filesystem’ with df. stands for Display Free disk space. It is a utility that displays statistics about the amount of free disk space on the specified mounted file system or on the file system of which file is a past.

I will be using the -h flag for human readable format, which ads unit suffixes eg. Gibibyte (GiB) to the ends of values. There is more to it, but I am mentioning the bread and butter of the flag.

df -h

Digging Deeper ⛏️

Getting to know nmap 🫶

nmap is actually a big command, by big I mean the number of options and capabilities are vast. It is quite popular with pen testers and is packaged with Kali Linux – a security analysis and exploitation focused distribution of Linux, so best believe it is something important. This essentially means I should take some time to get to know it more.

Stealth 🥷

Beware that nmap can and will trigger detection software like an active firewall, because nmap is conducting Funny Bizniz by way of packet trickery inside a network, both are technical terms. Luckily there is a stealth option ( -s ) that enables the mapping to take place un-detected –for the most part.

Contrary to what I assumed, the lower case s does not even stand for stealth, though it still helps as a mnemonic. No, it actually stands for SYN and SYN stands for Synchronize. It is part of the TCP three-way handshake, which is a process used to establish a reliable connection between two devices on a network and has nothing to do with guys meeting at a bar.

I thought we were trying to be stealthy, not synchronized 🤔 Well is actually a form of Funny Bizniz wherein SYN is sent and never acknowledged, thustly not completing the handshake process and therefore hiding activity somehow

but how? Well… it reduces the chance of being logged by the target system’s monitoring tools, such as firewalls or intrusion detection systems. Now we both know.

Keep in mind this is only one command option, just imagine how deep the rabbit hole goes.

Important Mapping Command List

Ninjas mark stealthier techniques.

#	Command	Description
1	`nmap -sS <target>` 🥷	TCP SYN scan (stealth mode)
2	`nmap -sT <target>`	TCP connect scan (full connection)
3	`nmap -sA <target>`	ACK scan to detect firewalls
4	`nmap -sU <target>`	UDP scan
5	`nmap -sP <target>`	Ping scan to detect live hosts
6	`nmap -sV <target>`	Detect service versions on open ports
7	`nmap -O <target>`	OS detection
8	`nmap -A <target>`	Aggressive scan (OS, version, scripts)
9	`nmap -Pn <target>` 🥷	Disable ping (stealthy, avoid detection)
10	`nmap -p- <target>`	Scan all 65,535 TCP ports
11	`nmap --top-ports 100 <target>`	Scan the top 100 most common ports
12	`nmap --script <script> <target>`	Use Nmap scripts for detailed scanning
13	`nmap -sC <target>`	Run default NSE scripts
14	`nmap -sW <target>`	TCP window scan
15	`nmap -T4 <target>`	Faster scan using timing template T4
16	`nmap -v <target>`	Enable verbose output
17	`nmap -oN scan.txt <target>`	Save output in normal format
18	`nmap -oX scan.xml <target>`	Save output in XML format
19	`nmap -6 <target>`	Scan IPv6 targets
20	`nmap -D RND:10 <target>` 🥷	Decoy scan to mask source of scan

The Actual Mapping

Network Map

🚀 Accessing the Secure Jump Point

The network is safely tucked behind a Dynamic Domain Name System (DDNS) running on an Asus Router. This setup allows access from a WAN, in this case, The Internet (if you’ve heard of it 🌐). It securely gate-keeps the network via login credentials, because let’s be honest, the Internet can be a scary place with bots… and sometimes people 👀.

So, what exactly does DDNS do? Well, I’m glad you asked! Most home internet connections don’t offer a static IP address, which makes hosting things tricky because the IP will randomly change. The IP is dynamic for a few reasons, including cost savings and the limited availability of IPv4 addresses. Anyway, I’m getting off track 🛤️. A DDNS monitors this dynamic IP and links it to a stable address that stays fixed.

TL;DR: DDNS bonds a dynamic IP with a fixed address, offering the added bonus of hiding the internal IP from bad actors, or as I like to call them, the baddies in London 🕵️‍♂️. So, this network has a DDNS gateway in place for that extra layer of security 🔒.

🎯 Jumping into a Node

Once a credentialed fellow enters their login while hanging at the gateway, a list of servers appears, like a digital menu 🍽️. From this list, one can choose where to jump to. In my case, I leapt to Rocky12, a node within a managed cluster. I knew a little about this network thanks to earlier sessions, but most of this can also be discovered by doing some network scanning 🕵️.

🔍 Doing a Broad Inventory Scan

This is where the magic of nmap comes into play! I used two stealthy commands: nmap -sS and nmap -sT. The sT option scanned a wide range of ports and connections, giving me a detailed list that I piped into less for easier viewing 📜.

🎯 Doing Targeted Scans

I grepped all the IP addresses and ran an -sS SYN Scan to gather more details on each node. These scans revealed loads of information about open ports and, in some cases, the hostnames of devices 🎯.

🧩 Identifying Devices

I quickly identified the warewulf orchestration device and all of the Rocky nodes. However, a few mysterious devices needed some detective work 🔎. I noticed glrpc listed as a service while scanning six addresses—three on one IP range and three on another. A quick Google search revealed that glrpc relates to GlusterFS, a filesystem specific to Red Hat systems. After watching a video explainer on GlusterFS, I figured out that these two IP ranges were likely a RAID or high availability configuration 💾.

🗺️ Mapping with Excalidraw

I initially created a highly technical Engineer’s map filled with data and presented it to Het for feedback. He advised me to think about how management would interpret it 🤔. So, back to the drawing board! I focused on improving the visual presentation and labeling, keeping in mind that management doesn’t care about the nitty-gritty; they just need a clear, high-level understanding during briefings 📊.

🎉 Wrapping Up

This concludes my exercise in mapping an unknown network! I learned a lot from this experience and am quite proud of the outcome 💪. It will undoubtedly come in handy when I find myself in future scenarios with many unknowns.

MITRE ATT&CK - 1 🪚

is a globally-accessible knowledge base of adversary tactics and techniques based on real-world observations.

Somewhat unfamiliar terminology for me

Lateral Movement

Lateral movement in the context of cyber exploitation refers to an attacker’s strategy of moving across a network to gain access to additional systems or sensitive data after initially infiltrating a single point. This involves leveraging compromised credentials, escalating privileges, or exploiting vulnerabilities to navigate between hosts and systems. The objective is often to broaden access within the environment while avoiding detection, eventually targeting critical infrastructure or data.

Exfiltration 🧳🚀

refers to the unauthorized transfer of data from a target system or network to an external location controlled by an attacker. This can involve methods such as encrypted tunnels, covert channels, or compromised accounts to avoid detection. Exfiltration is typically the final step in a cyberattack, allowing attackers to steal sensitive data, intellectual property, or credentials for further malicious activities.

Security is everybody’s issue

It is important to understand the data on your system regardless of specific responsibility in order to asses risk.

Impact of Exfiltration

The impacts of data exfiltration can be massive. Given the level of severity an exfiltration incident can entirely destroy an organization or cause financial losses, reputational damage, labour force diversion, data loss and jeopardizing future security.

The Linux User Environment

The Linux user environment is a customizable space that includes settings like environment variables, shell configurations, and startup scripts, typically defined in files such as .bashrc or .profile. These configurations allow users to tailor command line behavior, automate tasks, and create a personalized and efficient working environment.

Customizations to the User Environment

Customizations for the user environment in Linux might include setting up aliases for frequently used commands, configuring environment variables, and adding functions to streamline workflows. These changes can make repetitive tasks faster, improve the command line interface’s convenience, and adapt the environment to personal preferences.

Problems may arise

Problems around helping users with their dot files often stem from the diverse and sometimes incompatible changes users make to suit their needs. This can lead to issues like conflicting configurations, inconsistent behavior across systems, or difficult debugging when unexpected behaviors arise from custom scripts.

Definitions/Terminology

Footprinting 👣

Footprinting is an essential phase in ethical hacking and system security. It involves gathering information about a computer system, network, or organization to understand its structure and identify potential vulnerabilities. Footprinting is often the first step of a cyberattack or penetration test, allowing attackers or security professionals to map out an environment before deciding how to approach the next steps.

Scanning 🔍

Scanning is a process of actively probing systems or networks to identify open ports, services, and vulnerabilities. It’s used by attackers to gather deeper insights for potential exploitation and by security professionals to assess weaknesses. Common tools include Nmap and Nessus, while defenses include firewalls and IDS/IPS systems.

Enumeration 💯

Enumeration is the process of extracting more detailed information about a target, such as usernames, network shares, and system services, after identifying open ports and active systems. It typically involves active engagement with the target to gain in-depth knowledge that can be used for exploitation.

System Hacking 🪓

System hacking is the process of gaining unauthorized access to individual systems or networks by exploiting vulnerabilities. It involves activities such as password cracking, privilege escalation, installing backdoors, and covering tracks. Ethical hackers use these techniques to assess system security and recommend protective measures.

Escalation of Privilege 🥉🥈🥇

Privilege escalation is the process of gaining higher-level permissions or privileges than initially granted, allowing an attacker to execute commands with elevated authority. This can be achieved through exploiting vulnerabilities or misconfigurations, leading to unauthorized access to restricted resources or system control.

Rule of least privilege 🔐

The Rule of Least Privilege (LoP) is a security principle that states users, applications, and systems should only be granted the minimum level of access or permissions necessary to perform their tasks. This helps reduce the attack surface, limit potential damage from breaches, and mitigate insider threats.

Covering Tracks 🧹👣

Covering tracks is the process attackers use to hide their unauthorized activities and avoid detection. This involves techniques such as deleting or modifying system logs, using rootkits, and clearing command histories to prevent system administrators or security teams from discovering their presence or actions.

Planting Backdoors 🪴🚪

Planting backdoors involves installing hidden access points in a system, allowing attackers to bypass regular authentication and gain unauthorized access at a later time. Backdoors can be inserted through malicious code, vulnerabilities, or modifications to existing software, making them useful for maintaining persistent control over compromised systems.

ProLUG Links ⛓️

MITRE | ATT&CK Site Knowledge Base. ↩︎

GO CLI Utility

October 15, 2024

Making a Utility for Linux 🚀

Recently, I was chatting with fellow enthusiasts in the ProLUG group about creating system utilities with Go.

Between learning about it on my own time and getting input from others, I realized there are so many compelling reasons for using Go as a utility language. 🛠️

Currently, the idea isn’t as prevalent as Python or Bash, but it’s gaining traction. This makes sense since Go was designed with modern, interconnected systems in mind.

What is a Utility? 🤔

A system utility streamlines repetitive tasks by offering custom commands tailored to your workflow, boosting efficiency for system administrators and engineers. Work smarter, not harder. 💡

CLI Tool 🖥️

Go offers a package called Cobra CLI ¹, which enables the creation of command-line interfaces that take input from the terminal and execute predetermined logic. I’m putting this simply in case you’re unfamiliar with how programs work 😆.

The CLI is the simplest form of a computer program and is perfect for Linux system folks.

Core Idea 🎯

The core idea of this article is to explain how a Go program can act as a personalized utility that lives on your system or can be deployed across many. Go compiles code into a binary—native computer language. Once built, the binary runs reliably without failure, assuming the program logic is solid. This makes Go ideal for utilities that are called repeatedly to perform simple tasks.

Automate Everything 🤖

Managing systems involves a lot of repetitive tasks, and the goal of any sysadmin or engineer is to automate them. Tools like Ansible, Python, and Bash are already well-known for automation. However, there are specific reasons to use Go, which I’ll explore further in a future article titled “Why Go?” where I’ll break down why it belongs in your quiver of tools 🏹.

CLI Programming Portion ⚙️

Setup

Install Go and Cobra CLI (if not already installed).
Initialize a new Go project with Cobra CLI.

CLI Logic

Add commands and flags to the Cobra application.
Implement utility functions, such as disk usage checks or file listings.

Compilation and Installation 🏗️

Compile

Build the Go project into a single x86 Linux binary.
Test the binary locally to confirm functionality.

Install

Move the binary to /usr/local/bin for global access.
Set permissions on the binary to allow execution.

Add to PATH

Check if /usr/local/bin is in your PATH.
If not, add it to your .bashrc or .bash_profile.

Testing Portion 🧪

Run the utility from any location to confirm it works.
Optionally, add help documentation or more commands to the CLI.

Packaging 📦

Provide installation instructions if you plan to share it.
Upload the binary or source code to GitHub for distribution.

Learning Resources

Ok this was just a simple outline you say. Yes, I am trying to convey information simply for sanities sake, this is how I process things anyway.

In order to make your own utility you will have to go off on your own learning adventure. Below I am giving you some resources for getting started.

Getting up and running with GO

Brief Intro 2

For the impatient 3

There are many ways to get this done. For the clever and impatient, I suggest finding a pre-made CLI application that you can deconstruct. ³

For the Patient 4

For those who like to build a foundation of understanding, I suggest watching a lecture followed by a tutorial and finally doing the project. ⁴

Conclusion

I try to keep things simple and succinct, so this is where I ride off into the sunset. Good luck and happy learning.

Footnotes ⛓️

Cobra CLI Source Code GitHub Repository User: spf13, Current. ↩︎
GO in 100 Seconds Youtube Video Channel: Fireship, 2021. ↩︎
Cobra CLI Samples GitHub Repository User: Adron, 2022. ↩︎ ↩︎
Golang Tutorial Series Playlist Youtube Channel Channel: Net Ninja, 2021. ↩︎ ↩︎

ProLUG Admin Course Capstone Submission Requirements 🐧

October 6, 2024

Intro 👋

To complete the Professional Linux User Group (ProLUG) Professional Administrator Course (PAC), we are required to submit a final Capstone Project.

This article is a bit out of order, as I proactively chose a project topic before it was formally introduced. The purpose of this article is to document the project requirements as set out by our instructor.

Of course, I like to make slight modifications to ensure everything is neatly formatted and grouped—with plenty of emoji usage for extra flair! 😉