banatie-service/apps/api-service/NETWORK_ERROR_DETECTION.md

135 lines
4.4 KiB
Markdown

# Network Error Detection
## Overview
This implementation adds intelligent network error detection to the Banatie API service. It follows best practices by **only checking connectivity when errors occur** (zero overhead on successful requests) and provides clear, actionable error messages to users.
## Features
### ✅ Lazy Validation
- Network checks **only trigger on failures**
- Zero performance overhead on successful requests
- Follows the fail-fast pattern
### ✅ Error Classification
Automatically detects and classifies:
- **DNS Resolution Failures** (`ENOTFOUND`, `EAI_AGAIN`)
- **Connection Timeouts** (`ETIMEDOUT`)
- **Connection Refused** (`ECONNREFUSED`)
- **Connection Resets** (`ECONNRESET`, `ENETUNREACH`)
- **Generic Network Errors** (fetch failed, etc.)
### ✅ Clear User Messages
**Before:**
```
Gemini AI generation failed: exception TypeError: fetch failed sending request
```
**After:**
```
Network error: Unable to connect to Gemini API. Please check your internet connection and firewall settings.
```
### ✅ Detailed Logging
Logs contain both user-friendly messages and technical details:
```
[NETWORK ERROR - DNS] DNS resolution failed: Unable to resolve Gemini API hostname | Technical: Error code: ENOTFOUND
```
## Implementation
### Core Utility
**Location:** `src/utils/NetworkErrorDetector.ts`
**Key Methods:**
- `classifyError(error, serviceName)` - Analyzes an error and determines if it's network-related
- `formatErrorForLogging(result)` - Formats errors for logging with technical details
### Integration Points
1. **ImageGenService** (`src/services/ImageGenService.ts`)
- Enhanced error handling in `generateImageWithAI()` method
- Provides clear network diagnostics when image generation fails
2. **PromptEnhancementService** (`src/services/promptEnhancement/PromptEnhancementService.ts`)
- Enhanced error handling in `enhancePrompt()` method
- Helps users diagnose connectivity issues during prompt enhancement
## How It Works
### Normal Operation (No Overhead)
```
User Request → Service → Gemini API → Success ✓
```
No network checks performed - zero overhead.
### Error Scenario (Smart Detection)
```
User Request → Service → Gemini API → Error ✗
NetworkErrorDetector.classifyError()
1. Check error code/message for network patterns
2. If network-related: Quick DNS check (2s timeout)
3. Return classification + user-friendly message
```
## Error Types
| Error Type | Trigger | User Message |
|-----------|---------|--------------|
| `dns` | ENOTFOUND, EAI_AGAIN | DNS resolution failed: Unable to resolve Gemini API hostname |
| `timeout` | ETIMEDOUT | Connection timeout: Gemini API did not respond in time |
| `refused` | ECONNREFUSED | Connection refused: Service may be down or blocked by firewall |
| `reset` | ECONNRESET, ENETUNREACH | Connection lost: Network connection was interrupted |
| `connection` | General connectivity failure | Network connection failed: Unable to reach Gemini API |
| `unknown` | Network keywords detected | Network error: Unable to connect to Gemini API |
## Testing
Run the manual test to see error detection in action:
```bash
npx tsx test-network-error-detector.ts
```
This demonstrates:
- DNS errors
- Timeout errors
- Fetch failures (actual error from your logs)
- Non-network errors (no false positives)
## Example Error Logs
### Before Implementation
```
[2025-10-09T16:56:29.228Z] [fmfnz0zp7] Text-to-image generation completed: {
success: false,
error: 'Gemini AI generation failed: exception TypeError: fetch failed sending request'
}
```
### After Implementation
```
[ImageGenService] [NETWORK ERROR - UNKNOWN] Network error: Unable to connect to Gemini API. Please check your internet connection and firewall settings. | Technical: exception TypeError: fetch failed sending request
```
## Benefits
1. **Better UX**: Users get actionable error messages
2. **Faster Debugging**: Developers immediately know if it's a network issue
3. **Zero Overhead**: No performance impact on successful requests
4. **Production-Ready**: Follows industry best practices (AWS SDK, Stripe, Google Cloud)
5. **Comprehensive**: Detects all major network error types
## Future Enhancements
Potential improvements:
- Retry logic with exponential backoff for transient network errors
- Circuit breaker pattern for repeated failures
- Metrics/alerting for network error rates
- Health check endpoint with connectivity status