A Practical Guide to System Design Thinking

Not Your Traditional System Design Blog

I'll be honest - there are tons of system design resources out there. That's not what we're doing today. Instead, I want to share how I think about system design - the mental model I've developed from building startups from scratch and scaling systems at enterprises.

Don't worry, it's not just theory. I've included detailed resources from an actual system design case study that I recently did to show these principles in action.

A Quick Disclaimer

Rather than preaching a fixed method, I want to encourage critical thinking. I'm constantly evolving my approach, and I hope you'll do the same. Take my methods with a grain of salt, understand them, and improve upon them. If you discover something interesting along the way, shoot me an email at hello@anfalmushtaq.com. Let's keep this fountain of knowledge flowing.

Setting the Stage

Before we dive in, let's get one thing straight: technology decisions should always follow stakeholder needs, not lead them. Yes, you'll drive the technical choices, but they should be rooted in actual problems we're trying to solve.

Let's see this approach in action by designing a real system.

Define the Problem

Let's start with our goal: design a sales forecasting system that helps customers see their sales forecast on weekly, monthly, and yearly bases. The service needs to be highly available, accurate, and consistent.

That's it - that's all we know. I've intentionally kept this vague to show you how to approach system design when requirements aren't crystal clear.

Starting with Why

Before jumping into how to build it, let's ask the most important question: why do our stakeholders need this? We're qualified enough to design this system, so we should be critical enough to validate its value.

Here's a principle I always remember: If you asked a businessman 100 years ago what they wanted, they'd probably say "a faster horse." As problem solvers, it's our job to recognize that what they really need is a car.

For our sales forecasting system, here's why it matters:

Help businesses forecast future sales
Enable understanding of customer spending patterns
Support informed inventory decisions
Guide budget planning based on predicted revenue

Technical Analysis

Let's get specific about what we're building. This isn't theoretical - in a real scenario, these numbers would come from your company's data. For now, I'm making informed assumptions.

We're looking at:

A time-series ML model resistant to overfitting and vanishing gradients
~250 million daily records (I'll share the napkin math shortly)
7 years of data retention for compliance
High availability with strong consistency (yes, we'll tackle the CAP theorem trade-offs)
7% YoY growth, leading to terabytes of data

All of this will run on AWS - not because it's the only choice, but because it offers the complete solution we need for this case study.

Requirements Breakdown

Let's break this into clear, actionable requirements. I'm dividing these into functional and non-functional requirements because these will drive our technical decisions.

Functional Requirements

Straightforward stuff - our system needs to:

Show sales predictions by product category
Support different time frames (week/month/year)

Non-functional Requirements

Here's where it gets interesting:

Performance
- Handle 2,800+ writes/second (normal)
- Scale to 14,000+ writes/second (peak)
- Keep prediction response time under 100ms
High Availability
- 99.99% uptime
- Multi-region deployment
- Automated failover mechanisms
Consistency
- Strong consistency for sales data
- Eventual consistency for ML models
- Cross-region data sync
Accuracy & Quality
- 90%+ prediction accuracy
- Data quality validation in pipeline
- Model performance monitoring
Scalability
- Support 5M MAU, 500K DAU with 7% YoY growth
- Handle 365TB before archival
- Maintain 7 years of cold storage

Special Considerations

Here's a crucial point that's easy to miss: new businesses won't have historical data. Our ML model typically needs 3 years of data to identify patterns effectively. We'll need an alternative strategy for these new users - something we'll address in our design.

The Math Behind It

You might wonder why we're doing math before architecture. Here's why: these calculations drive everything from capacity planning to cost estimates. They help us make informed decisions about infrastructure choices and validate our architecture can handle real-world load.

Let's break down our assumptions and calculations. I've documented my detailed napkin math here. In a real-world scenario, you'd get these numbers from your analytics team, but for our case study, I've researched industry standards to make realistic assumptions.

Understanding User Interactions

Before diving into the technical architecture, we need to understand how different users will interact with our system. I've created a use-case diagram that maps out these interactions. This diagram isn't just for documentation - it helps us validate that our technical requirements align with actual user needs and ensures we're not over-engineering or missing critical functionality.

Data Model Design

Before diving into architecture, let's understand our data model. While we'll use DynamoDB for hot storage, let's first look at the logical relationships between our entities: view diagram.

Logical Relationships

Business to Categories (1:N): Each business can have multiple product categories
Sales Records: Track transactions with necessary metadata
ML-specific entities: Track model versions and performance metrics

DynamoDB Implementation

In DynamoDB, we denormalize these relationships for performance and durability. Here's how:

Sales Records Table
- Partition Key: business_id
- Sort Key: timestamp
- Global Secondary Index on category_id and timestamp for category-specific queries
- Instead of foreign keys, we replicate essential product data:
- Product name, category, price at time of sale
- Protects against data loss if product is deleted
- Preserves historical accuracy of sales records
Products Table
- Partition Key: business_id
- Sort Key: product_id
- Maintains current product catalog
- Changes don't affect historical sales data
Business Metadata Table
- Partition Key: business_id
- Contains business details and active categories
ML Model Tables
- Separate tables for model versions and metrics
- Keeps ML operations isolated from transactional data

This denormalized structure provides:

Sub-millisecond query performance for recent sales
Data durability (no broken foreign key references)
Historical accuracy for sales analysis
Efficient category-based querying for ML features

Understanding these relationships helps us make better decisions about:

Data integrity in our application layer
Feature engineering for ML
Data archival strategy in S3

System Architecture Deep Dive

Let's get into the meat of our design. I've created a high-level architecture diagram that you can find here. Don't get overwhelmed - we'll break it down piece by piece.

Technology Choices

We chose AWS/DynamoDB/SageMaker because:

Complete solution for our needs
Strong ML integration
Global presence for multi-region
Managed services reduce operational overhead

Alternative stacks considered:

GCP: Good ML capabilities but more complex setup
Azure: Strong in ML but less mature in NoSQL offerings

Storage Strategy

Let's start with how we handle data - it's the foundation of our system.

Hot Storage (First 90 Days)

Using DynamoDB for recent data with strong consistency
Handling ~2.9K sales/hour normally, scaling to ~15K during peaks
Why DynamoDB? It gives us the strong consistency we need for sales data
90-day TTL to optimize costs (DynamoDB is expensive!)

Data Archival (90+ Days)

DynamoDB Streams trigger archival process through Kinesis
Data moves to S3 with lifecycle policies:
S3 Standard (90 days - 6 months)
S3 IA (6 months - 18 months)
Glacier (18 months - 7 years)

If you're interested in the technical details of this archival process, check out AWS's guide here.

ML Pipeline: The Brain

Here's where it gets interesting. After evaluating several options, I chose Amazon's DeepAR model on SageMaker for our time-series forecasting. SageMaker was the natural choice given our AWS stack and need for scalability. The DeepAR model handles:

Rolling window training with 3-year context
Seasonal patterns (think Black Friday, holidays)
Multiple time series (different businesses/categories)

We retrain our models weekly to maintain prediction accuracy, with continuous monitoring for performance degradation. I've documented the complete training flow in this sequence diagram.

Handling New Businesses

Remember our earlier challenge with new businesses? Here's how we solve it:

Business Size Bucketing: Group similar businesses together
Aggregate predictions based on bucket patterns
Fallback to naive forecasting (weekly averages) for very new businesses

For hands-on implementation details, check out this DeepAR tutorial here.

Data Flow

To tie it all together, I've created a detailed data flow diagram. Pay special attention to how data moves through different storage tiers and how the ML pipeline interacts with our archival system. This visualization shows the complete journey from initial sale to final prediction.

Security & Compliance Considerations

When you're handling business sales data, security isn't optional. Here's our comprehensive approach:

Authentication & Authorization

Cognito for user authentication
IAM roles for service-to-service communication
RBAC for different user types (businesses, admins, data scientists)

Data Security

KMS for encryption at rest
TLS for data in transit
PII handling compliance for sensitive business data

API Security

WAF rules for DDoS protection
Rate limiting to prevent abuse
API key management for service access

The Business Reality Check

Here's where we need to take off our engineering hat and think business. I've crunched the numbers, and here's what we're looking at:

Infrastructure Costs

Estimated monthly cost: ~155K USD
Full breakdown available here
Includes replication costs and buffer for services like Route53

Cost Per User

Breaking even would require ~$32.25/user/month (typical SaaS analytics tools range from $20-100/month)
And that's before adding profit margins or considering other business costs
Pretty steep for what might be just one feature in a larger product

Cost Optimization Strategies

I've identified several ways to optimize:

Technical Optimizations:

Use spot/reserved instances
Refine lifecycle policies
Reconsider hot storage duration
Optimize batch job timing
Implement async predictions

Business Strategies:

Tiered pricing (AI vs non-AI predictions)
Consider using LLMs for simpler predictions (research link)
Bundle with other high-value, lower-cost features
Selective data processing based on business categories

Future Cost Projections

With our 7% YoY growth rate:

Storage costs will increase significantly
ML training costs will grow with data volume
Consider reserved capacity planning

Monitoring & Deployment

Let's talk about keeping this system running smoothly in production.

Monitoring Strategy

We monitor three key areas:

Infrastructure metrics (CPU, memory, network)
Business metrics (sales volume, prediction accuracy)
User experience metrics (API latency, error rates)

Key monitoring points:

DynamoDB consumed capacity
ML model drift detection
Cross-region replication lag
API gateway 4xx/5xx rates

We use CloudWatch alarms with SNS topics for immediate alerts on critical issues.

Testing & Deployment

We follow a blue-green deployment strategy:

Deploy to secondary region
Run smoke tests and performance checks
Gradually shift traffic using Route53 weighted routing
Monitor error rates during transition
Roll back if needed using DNS failover

For testing, we maintain:

Integration test suite using production-like data
Load tests simulating peak traffic (14K writes/sec)
Chaos testing for failover scenarios
A/B testing framework for ML model updates

This approach lets us:

Deploy without downtime
Test thoroughly in production-like conditions
Fail fast and recover quickly
Validate ML model changes safely

System Limitations and Future Improvements

Let's be honest about our system's challenges:

Cost Structure
- High infrastructure costs might limit market accessibility
- Needs careful pricing strategy to remain competitive
New Business Predictions
- Initial predictions may be less accurate
- Requires time to build reliable historical data
- Current solutions are workable but not perfect
Future Improvements
- Explore cheaper prediction alternatives for smaller businesses
- Investigate industry-specific prediction models
- Consider hybrid approaches combining statistical and ML methods

Common Questions

Q: What if we need to support more than 25-30 categories per business?
A: The current design can handle it, but prediction accuracy might decrease. Consider hierarchical categories.

Q: How do we handle seasonal businesses?
A: DeepAR model inherently handles seasonality, but we might need longer training windows.

Q: What about international businesses with different peak times?
A: Our multi-region setup handles this. Each region processes its local peak independently.

Wrapping Up: Beyond Just Design

This wasn't just another system design walkthrough. I wanted to show you how I approach designing systems that actually work in the real world. Here are the key principles I follow:

Start with Why
- Question the problem before jumping to solutions
- Understand stakeholder needs deeply
- Validate the business impact
Numbers Matter
- Base decisions on concrete calculations
- Consider business viability alongside technical feasibility
- Be ready to optimize when costs don't make sense
Think Holistically
- Balance technical excellence with business reality
- Consider security from the start
- Plan for edge cases (like new businesses)

What's Next?

I'm constantly evolving my approach to system design, and I'd love to hear your thoughts. Have you tackled similar challenges differently? Found better ways to optimize costs? Drop me a line at hello@anfalmushtaq.com.

Remember: no system is perfect - it's all about making informed trade-offs. The key is understanding why you're making each decision and being ready to adapt as requirements change.

Resources

For your reference, here are all the detailed resources mentioned in this article: