Back to blog

My University Runs Their LMS on AWS. It Still Crashes Under a Few Hundred Students. (Part 1: The Autopsy)

---


Disclaimer: Everything documented here was performed using entirely passive, non-invasive techniques on a publicly accessible web service. Checking page source, reading response headers, running DNS lookups, timing HTTP requests — these are operations any browser performs automatically on every page load. No credentials were accessed. No systems were modified. No vulnerabilities were exploited. The server produced all observed behavior in response to standard HTTP requests.

This analysis was conducted for educational purposes and published in the interest of understanding how web infrastructure fails under load — a topic directly relevant to anyone building or maintaining backend systems.

If this post accurately describes your infrastructure and that makes you uncomfortable, the correct response is to fix the infrastructure. Not to contact me.


Every exam season at REVA, without fail, the LMS dies.

Not "slows down a bit." Dies. The entire exam cohort — anywhere from a few hundred to over a thousand students depending on the course — hits login before an MCQ exam and within ten minutes the portal is serving blank pages, timeouts, and the spinning wheel of academic despair. Then someone restarts it. Then it crashes again. Then a professor sends a WhatsApp message to the class group: "LMS is down, wait." Then the exam gets delayed. Then we all pretend this is normal.

The thing that finally cracked my patience wasn't the crash itself. It was sitting there for twenty-plus minutes waiting to submit a one-credit MCQ exam that I was going to copy anyway, because the portal's exam security is so absent it's almost impressive. Tab switches go completely undetected. Focus loss events go undetected. The entire anti-cheat layer is nonexistent — and this is not a hard problem. The Page Visibility API ships in every browser. A document.addEventListener('visibilitychange', ...) handler is five lines of JavaScript. That's not a knock at whoever built this specifically; that's just context for the overall level of attention the portal has received over the years.

So that scratched something in my brain. If the exam security is this unattended, what does the backend actually look like? I had a Linux machine, a terminal, and a target URL. Let's find out.

Everything I ran here is passive recon. No exploit modules, no brute force, nothing that modifies the server. All of this is publicly visible to anyone with a browser and basic tooling.

Step 1: Confirming What It Is

The URL told me most of what I needed: rulms.reva.edu.in/login/index.php. That exact path — /login/index.php — is Moodle's default login route, shipped that way out of the box.

Page source confirmed it in the third line: <meta name="generator" content="Moodle">. The session cookie is named MoodleSession — Moodle's hardcoded name. Response headers include x-redirect-by: Moodle. The theme is Bloom, a paid third-party Moodle skin. Font Awesome 6.5.1 in the CSS assets dates the install to late 2023 at the earliest, putting it at Moodle 4.3 or 4.4.

It's stock Moodle. Not custom software, not an in-house build. The open-source PHP LMS that universities have been installing and misconfiguring since 2002.

Confidence: 99%.

Step 2: Where Does It Live

I assumed a campus server room. Old hardware, one underpaid IT person managing it, the usual. I ran a DNS lookup:

bash
nslookup rulms.reva.edu.in

The response came back as a CNAME pointing to an [redacted].elb.ap-south-1.amazonaws.com hostname. elb is AWS Elastic Load Balancer. ap-south-1 is the AWS Mumbai region. The hostname resolved to three distinct IP addresses.

whois on those IPs confirmed: Amazon Data Services India, Mumbai. Reverse DNS on each address resolved to ec2-[...].ap-south-1.compute.amazonaws.com — EC2 instances.

REVA's LMS is not on a campus server. It's on AWS, in Mumbai, behind a load balancer, with three backend machines. On paper, this is a properly architected system. That's what makes the rest of this interesting.

Confidence: 100%.

Step 3: What the Load Balancer Is Actually Doing

Before getting into what's broken here, it's worth being precise about what a load balancer is and the specific type in use, because the distinction matters a lot.

A load balancer sits in front of your application servers and distributes incoming traffic across them. The idea is that instead of one server handling all requests and falling over, you spread the load across several. Simple concept. The implementation details are where things diverge.

There are two main types relevant here:

Layer 7 (Application Load Balancer / ALB) — operates at the HTTP level. It can read request headers, inspect cookies, route based on URL paths, and — critically — maintain session affinity by tracking which user maps to which backend server. If user A is on server 1, the ALB keeps sending user A to server 1.

Layer 4 (Network Load Balancer / NLB) — operates at the TCP level. It has no awareness of HTTP, cookies, or sessions. It sees raw TCP connections and forwards them. That's it. Distribution is typically round-robin: connection 1 goes to server A, connection 2 goes to server B, connection 3 goes to server C, repeat.

REVA is running an NLB.

Here's why that's a problem. When you log into Moodle, the server creates a session record and gives you a MoodleSession cookie as the key. Every subsequent request you make sends that cookie, and the server looks up your session to verify who you are. The session data lives on whichever backend server handled your login.

If your session lives on EC2-A and your next request gets routed to EC2-B by the NLB, EC2-B has never seen your session. It has no record of you. You get logged out or get an error — mid-login, mid-exam, mid-submission.

I tested this from a Linux machine:

bash
for i in {1..10}; do
  curl -o /dev/null -s -w "%{remote_ip}\n" https://rulms.reva.edu.in/login/index.php
done

Ten sequential requests from the same client. The output showed all three backend IPs rotating in random order. Pure round-robin. whatweb confirmed it from a different angle — the initial request to the domain was served by one backend server, and the immediate redirect to /login/index.php was served by a different one. Two different servers in a single browser interaction, before the login page has even loaded.

No session affinity was observable in any of my testing. Whether sticky sessions were explicitly disabled or simply never configured is not visible from outside — but the behavioral result is the same either way. The NLB is distributing connections across all three backends with no apparent attempt to keep a given client on the same server.

The NLB supports source IP affinity, which would at least route all traffic from a given IP to the same backend. Switching to an ALB would give proper cookie-based session stickiness. Neither appears to be in use.

The round-robin behavior: confirmed. Its direct impact on Moodle sessions — and on students getting logged out mid-exam — follows directly from how Moodle stores session state by default. If sessions are stored on individual EC2s with no shared session backend, a student whose session was created on EC2-A will get logged out the moment the NLB routes their next request to EC2-B. Whether Moodle has been configured with a shared session store (Redis, Memcached, or database) that would make this a non-issue is one of the few things I genuinely can't determine from outside.

Step 4: The Web Server

curl -I on the login page returned:

code
server: green-banana

Not nginx. Not Apache. green-banana.

In Nginx, this is a single config line:

nginx
add_header Server green-banana;

Someone renamed it deliberately — a security-through-obscurity move to prevent automated scanners from fingerprinting the server and targeting Nginx-specific exploits. Not a bad idea in isolation. The 404 error pages gave it away anyway: stock Nginx error page template, just without the version number. They obscured the identity but forgot to customize the error pages.

The more significant find was HTTP/2. I checked:

bash
curl -v --http2 https://rulms.reva.edu.in/login/index.php 2>&1 | grep -i "connection"

No upgrade. HTTP/1.1 only. Every response header also confirmed connection: close.

To understand why this matters: the Moodle login page loads somewhere between 20 and 30 assets — CSS bundles, JavaScript files, web fonts, images. With HTTP/1.1 and connection: close, each of those assets is an independent TCP connection. The full cycle: SYN → SYN-ACK → ACK → request → response → FIN. From scratch, every time, for every file.

HTTP/2 changes this fundamentally. It uses a single TCP connection per client and multiplexes all asset requests over it simultaneously — a technique called stream multiplexing. The browser sends all 25 requests in parallel over one connection, the server responds to each as they complete, and the connection stays open for the next page. This is not experimental technology. HTTP/2 became a standard in 2015. Nginx has shipped with HTTP/2 support since 2015. It takes two lines of config to enable:

nginx
listen 443 ssl http2;

That's it. It's not there.

When a few hundred to over a thousand students load the login page simultaneously, each of them is burning 20–30 TCP handshake cycles just to see the login form. The server is handling potentially tens of thousands of connection open/close cycles before a single credential has been submitted.

HTTP/1.1 only, connection: close — confirmed: 100%.

Step 5: The Database

Port scan:

bash
nmap -sV -p 80,443,3306,5432,6379,11211,8080,8443 rulms.reva.edu.in

Every port except 80 and 443 returned filtered. This distinction matters technically: closed means the port is reachable but nothing is listening; filtered means a firewall is actively dropping packets — the service may exist behind it, or it may not. Port 3306 (MySQL): filtered, so possibly running and firewalled, though that's inference. Port 6379 (Redis): filtered, status unknown. Port 11211 (Memcached): filtered, status unknown. Firewalling these ports is correct practice regardless, so it tells me about security posture more than about what's actually running.

I then tested response times by hitting each backend server directly, bypassing the load balancer:

bash
curl --resolve rulms.reva.edu.in:443:[IP-A] ...  # 186ms
curl --resolve rulms.reva.edu.in:443:[IP-B] ...  # 167ms
curl --resolve rulms.reva.edu.in:443:[IP-C] ...  # 208ms

All three nearly identical. This is the tell. If the database were colocated on the same EC2 instance as PHP — sharing CPU cores and RAM between the application server and the database engine — that machine would be measurably slower than the others, especially under any real load. Three nearly identical response times across three separate machines points strongly to the database living on a separate instance, almost certainly Amazon RDS.

Separate DB instance: ~85% confidence. MySQL on RDS: ~75%.

What I can't confirm from outside: whether it's a single RDS instance or has read replicas, whether a connection pooler has been configured, or what max_connections is set to.

Here's why the DB tier is likely the bottleneck even without knowing those specifics. Moodle's default behavior — absent explicit pooler configuration — is to open a new database connection per PHP-FPM worker and hold it open for the entire request lifecycle. If no ProxySQL or PgBouncer has been placed in front of the database, every active PHP worker holds one live DB connection directly.

To illustrate the math: a mid-range RDS instance (say, a db.t3.medium) has a default max_connections around 312. Three EC2 instances running PHP-FPM at roughly 50 workers each would account for up to 150 simultaneous connections just at idle — before a single exam starts. The actual instance size and worker counts here are unknown; this is an illustrative example of the ceiling problem, not a specific claim about their configuration.

Each Moodle login generates roughly 15–20 database queries: credential lookup, session record creation, session data write, dashboard initialization, course enrollment queries, recent activity fetch, notification count. During a login rush, each PHP worker holds its DB connection open for the full duration of those sequential queries. If the connection ceiling is hit, new workers queue waiting for a slot. The queue backs up. Response times balloon. Timeouts start.

Single RDS instance, no external connection pooler, no read replicas: ~70% confidence — based on crash behavior patterns consistent with this setup, not direct observation.

Step 6: The 5-Second Moment

I ran a baseline response time test at around 3am — not for atmosphere, but because at 3am there are genuinely no other users on REVA's LMS unless there's a batch of insomniacs submitting late assignments. Zero concurrent load, essentially:

bash
for i in {1..20}; do
  curl -o /dev/null -s -w "%{time_total}\n" https://rulms.reva.edu.in/login/index.php
done

Most responses: 170–250ms. Acceptable for a quiet server.

Then request nine:

code
5.199260

5.19 seconds. One request. 3am. Zero other users on the platform. Immediately back to normal on the next request.

Unless there's a night shift of a hundred other unemployed REVA students hitting the LMS at 3 in the morning, that spike has nothing to do with concurrent load. It's a PHP worker that got stuck waiting — on a database connection, on a file lock, on something in the internal request path that intermittently blocks. Under zero load, the next worker picks it up in a few seconds and the queue clears. Under a few hundred concurrent users all logging in simultaneously, that same blocking condition becomes the steady state. Every worker is stuck. Nothing processes. The portal dies.

The 170–250ms baseline suggests some caching is active — probably Moodle's built-in file-based cache, which stores compiled templates and static PHP output to disk. File-based cache helps with rendered HTML. It does nothing for session reads and writes, which hit the database by default on every single page request — both reading the current session to authenticate you and writing updated session data at the end of each request.

Redis would replace the database for session operations entirely. A Redis session lookup is roughly 1ms. A database session lookup under contention is 10–50ms on a good day, and effectively infinite when the DB connection pool is exhausted. The difference is the crash.

No Redis configured for sessions: ~75% confidence.

Step 7: File Storage

Assignment download URL from the portal:

code
/pluginfile.php/[id]/assignsubmission_file/submission_files/[id]/ASSIGNMENT.PDF

pluginfile.php is Moodle's file-serving script. It authenticates the request, reads the file off local disk, and streams it byte-by-byte through PHP as the HTTP response. The PHP worker is occupied for the entire duration of that file transfer — it can't serve anything else until the download completes.

A proper cloud file storage setup looks completely different. You store files in S3, and when a student requests a download, PHP generates a pre-signed S3 URL (a time-limited, cryptographically signed URL that grants temporary access to a specific object) and returns a redirect. The actual file transfer happens directly between the student's browser and S3. PHP never touches the bytes. Workers stay free.

The more critical problem is where these files actually live. With no sticky sessions and three EC2 instances rotating requests round-robin, a file uploaded on EC2-A lives on EC2-A's disk. EC2-B and EC2-C have no copy of it. If a student uploads on one server and their next request is routed to a different one, that file doesn't exist from the perspective of the handling server.

Two possible explanations: either they've mounted an AWS EFS (Elastic File System) volume — essentially a network-attached disk — shared across all three instances, which would mean every file read/write crosses a network hop and adds latency. Or files are genuinely split across instances and some percentage of file retrievals silently fail depending on which server handles the request. I can't confirm which from outside.

Local disk file serving via PHP: confirmed 100%.

Step 8: The Part That Actually Got Me

I ran sslyze against the domain for TLS configuration analysis.

TLS 1.2 and 1.3 only — SSL 2.0, 3.0, TLS 1.0, and 1.1 all rejected. Cipher suites: ECDHE key exchange with AES-256-GCM and ChaCha20-Poly1305. Forward secrecy on every connection — meaning a compromised private key can't decrypt past sessions. No Heartbleed. No ROBOT attack vulnerability. No CCS injection. Fully compliant with Mozilla's recommended intermediate TLS profile.

Then the headers: PHP version hidden, no X-Powered-By. Nginx version suppressed. Server renamed to green-banana. Every common admin panel path — /phpmyadmin/, /pma/, /db/, /mysql/, /admin/phpmyadmin/ — all 404. No phpinfo() files left in the web root. README, CHANGELOG, and upgrade.txt all deleted. Let's Encrypt cert, most recently renewed April 16, almost certainly via a certbot cron job.

Someone who genuinely understands security built this. The TLS hardening is not accidental — it takes deliberate configuration to reject legacy protocols, choose strong cipher suites, and enable forward secrecy. The header scrubbing is intentional. The directory cleanup is intentional. This is the work of someone with real knowledge.

And then:

  • Sticky sessions on the NLB: a checkbox left unchecked
  • HTTP/2: two config lines not added
  • Keep-alive: one config line not added
  • HSTS: present but set to max-age=0, which actively instructs browsers to remove any stored HSTS policy for this domain rather than enforce one — worse than not having the header at all
  • Files on local disk behind a multi-server setup
  • No WAF, no rate limiting — every request hits PHP directly, no throttling on concurrent connections per IP
  • A 5-second latency spike under zero load at 3am that apparently no monitoring system ever flagged

The security configuration is meticulous. The performance configuration is Moodle install defaults, never revisited.

They built an actual lock on the front door and left the kitchen on fire.

The Full Picture

Best reconstruction of the actual running architecture, based on confirmed evidence and educated guesses where noted:

code
┌──────────────────────────────────────────────────────┐
│          Students (hundreds to 1000+ concurrent)     │
│                  during exam windows                 │
└──────────────────────┬───────────────────────────────┘
                       │ HTTPS / TCP
                       ▼
┌──────────────────────────────────────────────────────┐
│            AWS Network Load Balancer (NLB)           │
│              Mumbai  ·  ap-south-1                   │
│    Layer 4 TCP  ·  Round-robin  ·  No sticky         │
│    sessions  ·  No WAF  ·  No rate limiting          │
└───────────┬──────────────┬──────────────┬────────────┘
            │              │              │
            ▼              ▼              ▼
    ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
    │   EC2 - A   │ │   EC2 - B   │ │   EC2 - C   │
    │─────────────│ │─────────────│ │─────────────│
    │ Nginx       │ │ Nginx       │ │ Nginx       │
    │ "green-     │ │ "green-     │ │ "green-     │
    │  banana"    │ │  banana"    │ │  banana"    │
    │ HTTP/1.1    │ │ HTTP/1.1    │ │ HTTP/1.1    │
    │ conn: close │ │ conn: close │ │ conn: close │
    │─────────────│ │─────────────│ │─────────────│
    │ PHP-FPM     │ │ PHP-FPM     │ │ PHP-FPM     │
    │ Moodle 4.x  │ │ Moodle 4.x  │ │ Moodle 4.x  │
    │ Bloom theme │ │ Bloom theme │ │ Bloom theme │
    │─────────────│ │─────────────│ │─────────────│
    │ Local disk  │ │ Local disk  │ │ Local disk  │
    │ (files NOT  │ │ (files NOT  │ │ (files NOT  │
    │  shared)    │ │  shared)    │ │  shared)    │
    │─────────────│ │─────────────│ │─────────────│
    │  Sessions:  │ │  Sessions:  │ │  Sessions:  │
    │  DB/file    │ │  DB/file    │ │  DB/file    │
    │  (no Redis) │ │  (no Redis) │ │  (no Redis) │
    └──────┬──────┘ └──────┬──────┘ └──────┬──────┘
           └───────────────┼───────────────┘
                           │ MySQL  ·  port 3306
                           │ (firewalled, not exposed)
                           ▼
    ┌──────────────────────────────────────────────┐
    │          MySQL RDS — single instance         │
    │   No read replicas  ·  No connection pooler  │
    │   Default max_connections  ·  All writes,    │
    │   reads, and session ops hitting same DB     │
    └──────────────────────────────────────────────┘
ComponentFindingConfidence
Moodle 4.3/4.4, Bloom themeConfirmed99%
AWS EC2, Mumbai (ap-south-1)Confirmed100%
3 EC2 instances behind NLBConfirmed100%
NLB — no sticky sessionsConfirmed100%
Nginx, HTTP/1.1, connection: closeConfirmed100%
No HTTP/2Confirmed100%
Files on local disk (pluginfile.php)Confirmed100%
No WAF, no rate limitingConfirmed100%
Separate DB instance (not colocated)Educated guess85%
MySQL on RDS, single instanceEducated guess75%
No Redis for session storageEducated guess75%
No connection pooler (ProxySQL/PgBouncer)Educated guess70%
Overall architecture picture~85%

Why It Crashes

The full cascade, reconstructed:

  1. Exam window opens. The cohort — hundreds to over a thousand students — hits login within minutes.
  2. NLB distributes connections across three EC2s with no session affinity. Students start getting bounced between servers mid-login. MoodleSession cookies issued by EC2-A mean nothing to EC2-B or EC2-C.
  3. Each login generates ~15–20 sequential DB queries: credential validation, session record write, session data serialization, dashboard query, course enrollment fetch, notification count, recent activity load.
  4. No connection pooler means each PHP-FPM worker opens and holds a live DB connection for the entire duration of those queries. Workers are busy as long as the DB is responding.
  5. connection: close means Nginx is simultaneously handling thousands of brand-new TCP connections — one per asset per student — instead of reusing established ones.
  6. PHP-FPM worker pools on all three EC2s fill up. Nginx starts queuing requests.
  7. RDS max_connections ceiling approaches. MySQL begins queuing incoming connections internally.
  8. Now PHP workers are waiting on DB connections that are waiting on MySQL's internal queue. A full deadlock across the request pipeline.
  9. Memory pressure builds on the EC2s from queued requests holding open sockets and buffers. Linux OOM killer starts terminating PHP-FPM workers or, in bad cases, the MySQL client library itself.
  10. Portal stops responding. Every request times out.
  11. Someone SSH-es in. Restarts PHP-FPM or the whole EC2. DB connection pool clears. Memory frees up.
  12. Students rush back in. Step 1 repeats.

A few hundred concurrent users. AWS infrastructure. A load balancer. Three application servers. Paying Mumbai EC2 rates monthly. Crashes on exam day, every semester, on schedule.

The load is not the problem. A few hundred concurrent logins is a solved problem from roughly 2010. WordPress.com handles tens of millions of requests daily on comparable infrastructure. The problem is that the performance configuration has never been touched since the day Moodle was installed.

Conclusion

I want to be precise here, because it's easy to read this as "REVA's IT department is incompetent" and that's not what the evidence shows.

The security configuration on this server is genuinely good. TLS hardening, cipher suite selection, header scrubbing, directory cleanup, certificate management — someone made real, informed decisions on each of these. That's not accidental. That takes knowledge and deliberate effort. Whoever set this up clearly knew what they were doing on the security side.

The performance layer looks like a different story entirely — like it was configured once at installation and never revisited. No sticky sessions, no HTTP/2, no keep-alive, no Redis for sessions, no connection pooler, files on local disk behind a load balancer. None of these are exotic optimizations. They're standard practice for any production web application serving more than a handful of concurrent users. They're also all configuration changes, not infrastructure rebuilds. Most of it is an afternoon of work, some of it is closer to an hour.

My guess is one person set this whole thing up — probably under time pressure, probably without a dedicated DevOps budget — got the security right because they knew to care about it, and got the performance wrong because nobody told them the exam load patterns until after it was live. Then it became someone else's problem, and whoever that someone else is has other things to do.

That's a resource and prioritization problem, not an incompetence problem. The IT team got half the job done correctly under conditions that probably weren't ideal. The half they didn't get done is costing hundreds of students their exam time every semester.

For a system that handles academic assessments — things that affect grades, progression, and in some cases whether students can graduate on time — "crashes under normal exam load, requires manual restart, loses sessions mid-attempt" is not an acceptable operational state. Especially when the fixes are this straightforward.

Part 2 covers exactly what those fixes look like, what they cost, and what the architecture should actually look like for a system this size. Anyone from the REVA IT department, if youre reading this, for the love of god please follow up with the next post, and fix the backend.

Contact Me

All fields are required. I typically respond within 1-2 business days, depending on the volume of messages. Looking forward to connecting with you!

If the portfolio, experiments, or open-source work have been useful, you can support the work here.