A high-performance PyTorch image classification service with gRPC API, Docker containerization, and Redpanda streaming pipeline.
- PyTorch ResNet34 image classification model
- gRPC API for fast inference requests
- Docker containerization with multi-stage builds
- GPU/CPU auto-detection with fallback support
- TorchScript optimization for faster inference
- Redpanda streaming for real-time video processing
- Performance monitoring with latency and throughput metrics
- File logging with timestamps
- Load testing capabilities
ml-system-engineer-pytorch-docker/
├── inference_service/
├── streaming_simulator/
├── Images/
├── client.py
├── load_test.py
├── Dockerfile
├── requirements.txt
└── README.md
docker build -t ml-inference-server .
Terminal 1 - Start Redpanda:
docker run -p 9092:9092 redpandadata/redpanda:latest redpanda start --smp 1
Terminal 2 - Start ML Server:
docker run --gpus all -p 50052:50052 ml-inference-server
Simple test:
python client.py Images/cat.jpg
Load test:
python load_test.py Images/cat.jpg --requests 100 --concurrency 20
pip install -r streaming_simulator/requirements.txt
python streaming_simulator/create_topic.py
# Run for 60 seconds
python streaming_simulator/consumer.py --duration 60
# Run endlessly
python streaming_simulator/consumer.py
python streaming_simulator/producer.py path/to/video.mp4
The system tracks and logs:
- Inference latency per frame (milliseconds)
- Throughput (frames per second)
- Success rate for requests
- GPU/CPU utilization
# Default: endless
python streaming_simulator/consumer.py
# 30 seconds
python streaming_simulator/consumer.py --duration 30
# 120 seconds
python streaming_simulator/consumer.py --duration 120
# Default: 50 requests, 10 concurrent
python load_test.py Images/cat.jpg
# Custom load
python load_test.py Images/cat.jpg --requests 200 --concurrency 50
- Interval: 60 seconds
- Timeout: 10 seconds
- Retries: 3
The Docker image includes CUDA-enabled PyTorch. Run with --gpus all
to enable GPU acceleration.
- File:
logs/inference.log
(inside container) - Console: Real-time output
- Format:
YYYY-MM-DD HH:MM:SS,mmm - LEVEL - MESSAGE
Option 1: Mount logs directory to host
# Run container with volume mount
docker run --gpus all -p 50052:50052 -v "$(pwd)/logs:/app/logs" ml-inference-server
# View logs on host (in new terminal)
tail -f logs/inference.log
Option 2: Access logs inside container
# Get container ID
docker ps
# Access container shell
docker exec -it <container_id> bash
# View logs inside container
tail -f logs/inference.log
Option 3: Copy logs from container
# Copy log file to host
docker cp <container_id>:/app/logs/inference.log ./inference.log
Option 4: Use Docker logs (console output only)
docker logs -f <container_id>
- Console: Performance metrics with timestamps
- Metrics: Frame count, FPS, latency, predictions
Endpoint: localhost:50052
Method: inference
Request:
message InferenceRequest {
repeated bytes image = 1;
}
Response:
message InferenceReply {
repeated int32 pred = 1;
}
- Python 3.12+
- PyTorch (with CUDA support)
- gRPC
- OpenCV
- Pillow
- kafka-python
- Redpanda (Docker)
1. Port already in use
docker ps
docker stop <container_id>
2. GPU not detected
- Ensure NVIDIA Docker runtime is installed
- Run with
--gpus all
flag
3. Redpanda connection failed
- Check if Redpanda is running on port 9092
- Verify topic exists:
python streaming_simulator/create_topic.py
4. Consumer not processing frames
- Check if producer sent frames first
- Verify consumer offset: change
auto_offset_reset='earliest'
- Use GPU: Run with
--gpus all
for 10x speedup - TorchScript: Already enabled for faster inference
- Batch processing: Send multiple images in single request
- Concurrent consumers: Run multiple consumer instances
cd inference_service
python -m grpc_tools.protoc --python_out=. --grpc_python_out=. inference.proto
- Update
inference.py
with new model - Rebuild Docker image
- Test with
load_test.py