Exercises
---
Conceptual Questions
-
Explain the difference between parameter scaling, data scaling, and compute scaling.
-
Why do scaling laws often follow approximate power-law behavior?
-
What is compute-optimal training? Why can a smaller model trained on more data outperform a larger model trained on less data?
-
Explain the difference between training-time scaling and inference-time scaling.
-
Why is attention complexity quadratic in sequence length for standard transformers?
-
What are the main bottlenecks in large-scale AI systems?
-
Compare FP32, FP16, BF16, and INT8 computation. What tradeoffs exist between precision and efficiency?
-
Explain the purpose of:
- quantization
- pruning
- distillation
- gradient checkpointing
-
Why is memory movement often more expensive than arithmetic computation?
-
What is operator fusion? Why does it improve performance?
-
Explain the sim-to-real gap in robotics.
-
Compare imitation learning and reinforcement learning.
-
Why are world models important for embodied agents?
-
What are affordances? Give three examples from robotics.
-
Why is uncertainty estimation critical in scientific deep learning?
-
Explain the difference between correlation and causation.
-
Why are distribution shifts dangerous in deployed systems?
-
What is catastrophic forgetting?
-
Why is interpretability difficult in large neural networks?
-
Explain why benchmark accuracy alone is insufficient for evaluating advanced AI systems.
Mathematical Exercises
1. Parameter Scaling
Suppose validation loss follows:
$$ L(N) = 1.5N^{-0.08} + 1.2 $$
where $N$ is the parameter count in billions.
-
Compute the loss for:
- $N = 1$
- $N = 10$
- $N = 100$
-
What happens to the loss as $N \rightarrow \infty$?
2. Compute Scaling
Suppose compute scales as:
$$ C \propto ND $$
where:
- $N$ = parameters
- $D$ = training tokens
If compute budget doubles, describe three possible scaling strategies.
3. Attention Complexity
A transformer processes sequences of length $T$.
Standard attention requires:
$$ O(T^2) $$
operations.
-
Compare the relative attention cost for:
- $T = 512$
- $T = 2048$
- $T = 8192$
-
How much larger is the attention matrix when sequence length increases from 2048 to 8192?
4. Mixed Precision Memory Savings
A model has 8 billion parameters.
-
Estimate parameter memory usage in:
- FP32
- FP16
-
Assume each parameter requires:
- parameters
- gradients
- Adam first moment
- Adam second moment
Estimate total memory usage.
5. Quantization
Suppose a model requires 40 GB in FP16.
Estimate storage size after:
- INT8 quantization
- INT4 quantization
Ignore metadata overhead.
PyTorch Exercises
1. Count Parameters
Write a PyTorch function that counts:
- total parameters
- trainable parameters
for any model.
2. Mixed Precision Training
Modify a standard training loop to use:
autocastGradScaler
Measure memory usage before and after.
3. Gradient Checkpointing
Apply gradient checkpointing to a transformer block and compare:
- peak memory usage
- training speed
4. Profiling
Use torch.profiler to identify:
- slow operations
- memory bottlenecks
- synchronization overhead
in a training script.
5. Quantized Inference
Convert a pretrained model to INT8 and measure:
- inference latency
- model size
- accuracy degradation
Research Exercises
-
Read a recent scaling-law paper and summarize:
- the scaling variables
- the fitted equation
- the experimental setup
- the limitations
-
Compare two efficient transformer architectures.
-
Study a robotics benchmark and identify:
- sensor inputs
- action space
- evaluation metrics
- failure modes
-
Investigate one scientific deep learning system such as:
- protein folding
- weather forecasting
- molecular generation
- neural operators
-
Analyze one open problem in AI safety or alignment and explain why it remains difficult.
Open-Ended Projects
Project 1. Scaling Experiment
Train language models of several sizes and fit a scaling curve:
$$ L(N) \approx aN^{-b} + c $$
Plot:
- parameter count vs validation loss
- compute vs validation loss
Project 2. Efficient Inference System
Build a text-generation service with:
- quantized inference
- KV caching
- batching
- latency measurement
Measure throughput and memory usage.
Project 3. Physics-Informed Neural Network
Train a PINN to solve a partial differential equation such as:
- heat equation
- wave equation
- Burgers’ equation
Compare learned and analytical solutions.
Project 4. Robot Policy Learning
Train a robot policy in simulation using:
- imitation learning
- reinforcement learning
Evaluate sim-to-real transfer performance if hardware is available.
Project 5. Long-Context Evaluation
Evaluate transformer performance as context length increases.
Measure:
- memory usage
- inference latency
- retrieval accuracy
- degradation under long sequences
Chapter Summary
This chapter explored future directions in deep learning:
- scaling laws
- efficient AI systems
- scientific deep learning
- robotics and embodied AI
- open research problems
Modern AI research increasingly combines:
- large-scale optimization
- multimodal learning
- reasoning systems
- scientific computation
- interaction with tools and environments
Future progress will depend not only on larger models, but also on deeper understanding, more efficient systems, stronger evaluation methods, and safer deployment practices.