How do I track Errfile to see if something wrong happened?

Note: This tutorial shows how to use observer events to automatically monitor SWASH simulation error files and get notified when problems occur.

For a comprehensive guide on observer events with more advanced features and examples, see the full Observer Events tutorial.

The Problem

When running SWASH simulations, errors can occur that might not immediately stop the simulation but indicate problems that need attention. SWASH writes error information to files like Errfile-001, Errfile-002, etc., where the number corresponds to the vCPU that encountered the error. Manually checking these files is tedious, especially for long-running simulations.

The Solution

Use Inductiva's observer events to automatically monitor your SWASH simulation's error files and get email notifications when severe errors are detected.

Quick Setup

Here's how to set up automatic error monitoring for your SWASH simulation:

from inductiva import events

# Register an observer to monitor any Errfile for severe errors
events.register(
    trigger=events.triggers.ObserverFileRegex(
        task_id=task.id,
        file_path="Errfile-*",  # Wildcard matches Errfile-001, Errfile-002, etc.
        regex=r"Severe error (.+)"),  # Captures text after "Severe error "
    action=events.actions.EmailNotification(
        email_address="your@email.com")
)

How It Works

  1. ObserverFileRegex monitors any file matching the pattern Errfile-* in your task's working directory (e.g., Errfile-001, Errfile-002, etc.)
  2. Wildcard matching uses Linux-style * wildcards to match multiple error files, ensuring you catch errors from any vCPU that encounters problems
  3. Regular expression r"Severe error (.+)" detects lines containing "Severe error " and captures the error message that follows
  4. Email notification is sent immediately when a match is found, including the captured error message
  5. The observer runs in the background, so you don't need to actively monitor the simulation

Wildcard File Matching

The Errfile-* pattern uses Linux-style wildcard matching to monitor multiple error files:

  • Errfile-* matches any file starting with "Errfile-" followed by any characters
  • This includes Errfile-001, Errfile-002, Errfile-003, etc.
  • The number in the filename corresponds to the vCPU number that encountered the error
  • SWASH creates separate error files for each vCPU to avoid conflicts during parallel execution
  • Using wildcards ensures you don't miss errors from any vCPU, regardless of which one encounters the problem

Customizing the Error Detection

You can modify the regex pattern to detect different types of errors:

# Detect any error (case insensitive)
regex=r"(?i)error"

# Detect specific error types
regex=r"(?i)(fatal|critical|severe) error"

# Detect errors with specific patterns
regex=r"Error \d+: .*"

What You'll Receive

When a severe error is detected, you'll receive an email:

Complete Example

Here's a complete example of running a SWASH simulation with error monitoring:

import inductiva

swash = inductiva.simulators.SWASH()

task = swash.run(...)

# Set up error monitoring after creating the task
events.register(
    trigger=events.triggers.ObserverFileRegex(
        task_id=task.id,
        file_path="Errfile-*",  # Wildcard matches any Errfile
        regex=r"Severe error (.+)"),
    action=events.actions.EmailNotification(
        email_address="your@email.com")
)

# The observer will automatically monitor for errors during execution

Benefits

  • Automatic monitoring: No need to manually check error files
  • Immediate alerts: Get notified as soon as problems occur
  • Background operation: Monitoring happens without affecting simulation performance
  • Flexible detection: Customize regex patterns for different error types

This approach ensures you're immediately aware of any issues with your SWASH simulation, allowing you to take corrective action without delay.