ProLUG Admin Course Unit 12 🐧

November 30, 2024

Baselining & Benchmarking

The purpose of a baseline is not to find fault, load, or to take corrective action. A baseline simply determines what is. You must know what is so that you can test against that when you make a change to be able to objectively say there was or wasn’t an improvement. You must know where you are at to be able to properly plan where you are going. A poor baseline assessment, because of inflated numbers or inaccurate testing, does a disservice to the rest of your project. You must accurately draw the first line and understand your system’s performance.

Discussion Post 1:

Your manager has come to you with another emergency. He has a meeting next week to discuss capacity planning and usage of the system with IT upper management. He doesn’t want to lose his budget, but he has to prove that the system utilization warrants spending more.

What information can you show your manager from your systems?

You could present your manager with a progressive trend graph showing time on the x-axis and several fields on the y-axis that represent changes from a baseline, assuming the necessary data has been collected. With this information, it would be possible to predict when various system resources will reach their maximum capacity.

What type of data would prove system utilization? (Remember the big 4: compute, memory, disk, networking)
CPU load, process execution time, throughput. Disk Operations (IOPS). Networking Requests and Bandwidth. RAM utilization, memory paging/swapping rates.
What would your report look like to your manager?

Capacity Planning Report

Current and projected system utilization. By examining trends over time, we can predict when critical resources will reach their limits if no additional capacity is provisioned.

Key Areas of Focus

Compute Usage
Memory Load
Disk Resources
Networking Metrics

Historical Data and Trends

Below is a sample progressive trend graph over the last 6 months. The x-axis represents time (in weeks), while the y-axis shows percentage utilization relative to an established baseline.

Example Metrics (relative to baseline):

CPU Utilization (% of baseline)
Memory Load (% of baseline)
Disk IOPS (% of baseline)
Network Throughput (% of baseline)
Time (Weeks): 1 2 3 4 5 6 … 20 21 22 CPU Util(%): 50 52 55 57 60 62 … 80 82 85 Memory(%): 45 47 50 50 52 55 … 70 73 75 Disk IOPS(%): 30 32 35 36 38 40 … 60 63 68 Network(%): 40 42 45 47 49 51 … 75 78 80

As time progresses, each of the key metrics is trending upward, indicating increasing load and approaching capacity thresholds.

Projections

We can estimate the “time to ceiling” for critical resources. For instance, if CPU load is rising at an average rate of 2–3% per month, and we know that at 90% utilization the system will experience performance degradation.

Projected Time to CPU Ceiling: 3–5 months
Projected Time to Memory Ceiling: 6–8 months
Projected Time to Disk IOPS Ceiling: 8–10 months
Projected Time to Network Bandwidth Ceiling: 4–6 months

Recommendations

Compute: Consider adding more CPU cores or upgrading processors before reaching the predicted 90% utilization mark.
Memory: Upgrade RAM or optimize applications to reduce memory footprint.
Disk: Enhance disk subsystems or switch to faster storage (e.g., SSDs) to handle projected IOPS.
Networking: Increase network capacity (e.g., from 1Gb to 10Gb links) or optimize network traffic.

Conclusion

Investment in scaling resources now will prevent future performance bottlenecks, ensuring the system can continue to meet business demands effectively.

Discussion Post 2:

You are in a capacity planning meeting with a few of the architects. They have decided to add 2 more agents to your Linux Sytems, Bacula Agent and an Avamar Agent . They expect these agents to run their work starting at 0400 every morning.

What do these agents do? (May have to look them up)

Bacula is an open-source suite of tools designed to automate backup tasks. It’s widely regarded for its flexibility and reliability. Dell Avamar, on the other hand, is a commercial backup automation solution. Both tools handle incremental backups using custom daemons that monitor changes over time, offering greater sophistication than simple scheduling systems like Cron. Additionally, they can manage backups across diverse, heterogeneous storage environments.

Do you think there is a good reason not to use these agents at this timeframe?

This approach is about balancing workload. If all processes start at a fixed time, they can consume valuable resources simultaneously. The best schedule depends on the environment. For example, if the environment experiences downtime—such as a traditional office setting—starting backups at 4 a.m. might be fine. However, if services run around the clock, it’s better to stagger the tasks so they use only a fraction of the available resources at any given time. This approach also reduces the impact of failures, since not all systems are involved at once.

Is there anything else you might want to point out to these architects about these agents they are installing?

There are several factors architects should consider. However, in the context of this discussion, performance overhead is particularly relevant. They need to ensure that the chosen backup solutions won’t overburden the system’s resources and that there’s enough “breathing room” to maintain smooth operations.

Discussion Post 3: ‘TODO’

Your team has recently tested at proof of concept of a new storage system. The vendor has published the blazing fast speeds that are capable of being run through this storage system. You have a set of systems connected to both the old storage system and the new storage system.

Write up a test procedure of how you may test these two systems.

I did a bit of research regarding tooling for such a task and found FIO ‘Flexible Input / Output’, a program written for the purpose of testing systems with various scenarios. Rather than using BASH, I can run more comprehensive testing with more data to analyze using FIO.

Baseline Test

 fio --filename=/dev/new_storage_lun --direct=1 --rw=read --bs=128k --size=10G --numjobs=1 --iodepth=32 --runtime=300 --time_based --name=new_storage_seq_read
 fio --filename=/dev/old_storage_lun --direct=1 --rw=read --bs=128k --size=10G --numjobs=1 --iodepth=32 --runtime=300 --time_based --name=old_storage_seq_read

Running Mixed Workload Tests

 fio --filename=/dev/new_storage_lun --direct=1 --rw=randrw --rwmixread=70 --bs=4k --size=10G --numjobs=4 --iodepth=16 --runtime=300 --time_based --name=new_storage_mixed

Increased Concurrency and Scale

Increase numjobs and iodepth in subsequent runs to measure how performance changes:
fio –filename=/dev/new_storage_lun –direct=1 –rw=read –bs=128k –size=10G –numjobs=8 –iodepth=64 –runtime=300 –time_based –name=new_storage_high_concurrency
Run the same tests on the old storage system and record all metrics.

Stress/Soak Tests

12-hour continuous I/O test on both storage systems.
fio –filename=/dev/new_storage_lun –direct=1 –rw=randwrite –bs=4k –size=100G –numjobs=1 –iodepth=32 –runtime=43200 –time_based –name=new_storage_soak

AWK Line Parsing

I would then pipe the output of these commands to AWK to seperate out specific datapoints to append to files for full analysis.

How are you assuring these test are objective?

By gathering multiple datasets with varying run parameters, I can reduce statistical noise and better isolate data of interest by comparing these datasets against one another.

What is meant by the term Ceteris Paribus, in this context?

in the context of system benchmarking means that when measuring the performance of one specific aspect of the system, all other variables and conditions are kept constant. This approach ensures that any observed changes in performance can be attributed directly to the variable under test, rather than being influenced by unrelated fluctuations in the environment or system load.

Definitions & Terminology

Benchmark
High watermark
Scope
Methodology
Testing
Control
Experiment
Analytics
Descriptive
Diagnostic
Predictive
Prescriptive

Digging Deeper (optional)

Analyzing data may open up a new field of interest to you. Go through some of the free lessons on Kaggle, here: https://www.kaggle.com/learn

a. What did you learn?

b. How will you apply these lessons to data and monitoring you have already collected as a system administrator?

Find a blog or article that discusses the 4 types of data analytics.

a. What did you learn about past operations? b. What did you learn about predictive operations?

Download Spyder IDE (Open source)

a. Find a blog post or otherwise try to evaluate some data. b. Perform some Linear regression. My block of code (but this requires some additional libraries to be added. I can help with that if you need it.)

import matplotlib.pyplot as plt

from sklearn.linear_model import LinearRegression

size = [[5.0], [5.5], [5.9], [6.3], [6.9], [7.5]] price =[[165], [200], [223], [250], [278], [315]] plt.title(‘Pizza Price plotted against the size’)

plt.xlabel(‘Pizza Size in inches’)

plt.ylabel(‘Pizza Price in cents’)

plt.plot(size, price, ‘k.’)

plt.axis([5.0, 9.0, 99, 355])

plt.grid(True)

model = LinearRegression()

model.fit(X = size, y = price)

#plot the regression line

plt.plot(size, model.predict(size), color=‘r’)

Reflection Questions

What questions do you still have about this week?
How can you apply this now in your current role in IT? If you’re not in IT, how can you look to put something like this into your resume or portfolio?

Digging Deeper

1. Read the rest of the chapter https://sre.google/workbook/monitoring/ and note anything else of interest when it comes to monitoring and dashboarding.

2. Look up the “ProLUG Prometheus Certified Associate Prep 2024” in Resources -> Presentations in our ProLUG Discord. Study that for a deep dive into Prometheus.

3. Complete the project section of “Monitoring Deep Dive Project Guide” from the prolug-projects section of the Discord. We have a Youtube video on that project as well. https://www.youtube.com/watch?v=54VgGHr99Qg

Labs

https://killercoda.com/het-tanis/course/Linux-Labs/102-monitoring-linux-logs

https://killercoda.com/het-tanis/course/Linux-Labs/103-monitoring-linux-telemetry

https://killercoda.com/het-tanis/course/Linux-Labs/104-monitoring-linux-Influx-Grafana

While completing each lab think about the following:

a. How does it tie into the diagram below?

b. What could you improve, or what would you change based on your previous administration experience.

Install Grafana on the Rocky Linux system by adding the Grafana repo manually. Red = Inputs Blue = Outputs

Create a new repository configuration sudo vim /etc/yum.repos.d/grafana.repo

Paste:

[grafana] name=grafana baseurl=https://packages.grafana.com/oss/rpm
repo_gpgcheck=1 enabled=1 gpgcheck=1 gpgkey=https://packages.grafana.com/gpg.key sslverify=1

Verify using the DNF

sudo dnf repolist

sudo dnf clean - verifies whether files are working

Should see:

repo id repo name appstream Rocky Linux 8 - AppStream baseos Rocky Linux 8 - BaseOS extras Rocky Linux 8 - Extras grafana grafana/spl

Check the grafana package on the official repository

sudo dnf info grafana

Should see something similar 👀

Importing GPG key 0x24098CB6: Userid : “Grafana " Fingerprint: 4E40 DDF6 D76E 284A 4A67 80E4 8C8C 34C5 2409 8CB6 From : https://packages.grafana.com/gpg.key Is this ok [y/N]: y

Should see 👀

Name : grafana Version : 8.2.5 Release : 1 rchitecture : x86_64 Size : 64 M Source : grafana-8.2.5-1.src.rpm Repository : grafana Summary : Grafana URL : https://grafana.com License : “Apache 2.0” Description : Grafana

Install Grafana sudo dnf install grafana -y

⏳ Takes a while…

Restart SystemD unit sudo systemctl enable –now grafana-server

Verify sudo systemctl status grafana-server

5.5. Firewall (Security) Firewall is Managed by files present in /etc/firedwalld

cd /etc/firewalld/ ls -l

cp /usr/lib/firewalld/services/ssh.xml /etc/firewalld/services/example.xml

sudo firewall-cmd –add-service=grafana –permanent

sudo firewall-cmd –add-port=3000/tcp –permanent sudo firewall-cmd –reload

Create config file sudo vim /etc/grafana/grafana.ini
Change the default value of:

The option ‘http_addr’ to ’localhost’, the ‘http_port’ to ‘3000’, and the ‘domain’ option to your domain name as below. For this example, the domain name is ‘grafana.example.io’.

For non-standard port, be sure to uncomment ; [server] 👍 http_port = 4000 👍

The public facing domain name used to access grafana from a browser domain = grafana.example.io

7.1 Turn off the nasty default report of analytics 👺 [analytics] reporting_enabled = false

7.2. Restart the grafana service to apply a new configuration.

sudo systemctl restart grafana-server

Reverse Proxy Setup

Install NGINX

sudo dnf install nginx -y

Create a new server block for grafana

/etc/nginx/conf.d/grafana.conf

Required to proxy Grafana Live WebSocket connections

map $http_upgrade $connection_upgrade { default upgrade; ’’ close; } server { listen 80; server_name grafana.example.io; rewrite ^ https://$server_name$request_uri? permanent; } server { listen 443 ssl http2; server_name grafana.example.io; root /usr/share/nginx/html; index index.html index.htm; ssl_certificate /etc/letsencrypt/live/grafana.example.io/fullchain.pem; ssl_certificate_key /etc/letsencrypt/live/grafana.example.io/privkey.pem; access_log /var/log/nginx/grafana-access.log; error_log /var/log/nginx/grafana-error.log; location / { proxy_pass http://localhost:3000/; }

Proxy Grafana Live WebSocket connections location /api/live { rewrite ^/(.*) /$1 break; proxy_http_version 1.1; proxy_set_header Upgrade $http_upgrade; proxy_set_header Connection $connection_upgrade; proxy_set_header Host $http_host; proxy_pass http://localhost:3000/; } }

Next, verify the Nginx configuration

sudo nginx -t

Should see 👀

nginx: the configuration file /etc/nginx/nginx.conf syntax is ok nginx: configuration file /etc/nginx/nginx.conf test is successful 👍

Start and enable the Nginx service sudo systemctl enable –now nginx sudo systemctl status nginx

Install Prometheus (Saturday)

Rocky Prometheus Install

Add New User and Directory ‘prometheus’

create a new configuration directory and data directory for the Prometheus installation.

sudo adduser -M -r -s /sbin/nologin prometheus

create a new configuration directory ‘/etc/prometheus’ and the data directory ‘/var/lib/prometheus’

(Only needed for running as service)

sudo mkdir /etc/prometheus sudo mkdir /var/lib/prometheus

Note: All Prometheus configuration at the ‘/etc/prometheus’ directory, and all Prometheus data will automatically be saved to the directory ‘/var/lib/prometheus’. Installing Prometheus on Rocky Linux

Install Prometheus monitoring system manually from the tarball or tar.gz file.

Change the working directory to ‘/usr/src’ and download the Prometheus binary

cd /usr/src wget https://github.com/prometheus/prometheus/releases/download/v3.0.1/prometheus-3.0.1.linux-amd64.tar.gz

Extract

tar -xzf ***.tar.gz

cd into folder

Run bin:

./bin 👍

If bin works, proceed

Copy all Prometheus configurations to the directory ‘/etc/prometheus’ and the binary file ‘prometheus’ to the ‘/usr/local/bin’ directory.

Move prometheus configuration ‘prometheus.yml’ to the directory ‘/etc/prometheus.

sudo mv $PROM_SRC/prometheus.yml /etc/prometheus/

Move the binary file ‘prometheus’ and ‘promtool’ to the directory ‘/usr/local/bin/’.

sudo mv $PROM_SRC/prometheus /usr/local/bin/ sudo mv $PROM_SRC/promtool /usr/local/bin/

Move Prometheus console templates and libraries to the ‘/etc/prometheus’ directory.

sudo mv -r $PROM_SRC/consoles /etc/prometheus sudo mv -r $PROM_SRC/console_libraries /etc/prometheus

Edit Prometheus configuration ‘/etc/prometheus/prometheus.yml’

vim /etc/prometheus/prometheus.yml

On the ‘scrape_configs’ option, you may need to add monitoring jobs

The default configuration comes with the default monitoring job name ‘prometheus’ and the target server ’localhost’ through the ‘static_configs’ option.

Change the target from ’localhost:9090’ to the server IP address ‘192.168.1.10:9090’ as below.

Note: Scrape configuration containing exactly one endpoint to scrape: Here it’s Prometheus itself. scrape_configs: The job name is added as a label job=<job_name> to any timeseries scraped from this config. job_name: “prometheus”

metrics_path defaults to ‘/metrics’ scheme defaults to ‘http’.

static_configs: targets: [“192.168.1.10:9090”]

Change the configuration and data directories to the user ‘promethues’.

sudo chown prometheus:prometheus /etc/prometheus sudo chown prometheus:prometheus /var/lib/prometheus

Basic prometheus installation finished, Hopefully 👍 .

Configure Prometheus

Create a new systemd service sudo vim /etc/systemd/system/prometheus.service

Copy and paste the following configuration.

[Unit] Description=Prometheus Wants=network-online.target After=network-online.target

[Service] User=prometheus Group=prometheus Type=simple ExecStart=/usr/local/bin/prometheus
–config.file /etc/prometheus/prometheus.yml
–storage.tsdb.path /var/lib/prometheus/
–web.console.templates=/etc/prometheus/consoles
–web.console.libraries=/etc/prometheus/console_libraries

[Install] WantedBy=multi-user.target

Reload the systemd manager to apply a new config.

sudo systemctl daemon-reload

Start and enable the Prometheus service

sudo systemctl enable –now prometheus sudo systemctl status prometheus

Prometheus monitoring tool is now accessible on the TCP port ‘9090.

Visit IP address with port ‘9090’

http://192.168.1.10:9090/

And you will see the prometheus dashboard query below.

Prometheus query dashboard

Reflection Questions

What questions do you still have about this week?
How can you apply this now in your current role in IT? If you’re not in IT, how can you look to put something like this into your resume or portfolio?

ProLUG Links ⛓️

Discord: https://discord.com/invite/m6VPPD9usw Youtube: https://www.youtube.com/@het_tanis8213 Twitch: https://www.twitch.tv/het_tanis ProLUG Book: https://leanpub.com/theprolugbigbookoflabs KillerCoda: https://killercoda.com/het-tanis