Skip to content
CI/CD Best Practices

Debugging Failed Pipelines

Pipeline failures are one of the most frustrating experiences in software development. When your build fails, you lose time to context switching, deployments get blocked, and the whole team slows down. The average developer loses 5-10 minutes per failure just getting back into flow state.

The good news? Most pipeline failures fall into predictable patterns, and with the right troubleshooting approach, you can reduce your mean time to resolution (MTTR) by 50-70%.

Why systematic debugging matters

Debugging pipelines effectively is a learnable skill that compounds over time:

  1. Faster resolution: Systematic approaches help you identify root causes quickly instead of trying random fixes.
  2. Pattern recognition: Once you’ve seen a failure type, you’ll recognize it immediately next time.
  3. Team independence: You won’t need to wait for DevOps support for common issues.
  4. Confidence: Understanding how to debug gives you confidence to iterate faster.

The debugging process follows a simple cycle: Observe (gather information) → Hypothesize (form a theory) → Test (try a fix) → Verify (confirm it works). When a fix doesn’t work, you cycle back with new information.

The debugging cycle showing iterative problem-solving

Quick Checks - Start Here

Before investigating deeply, these quick checks resolve 60% of pipeline failures in under 5 minutes. Run through this list first:

Quick wins checklist with time estimates

1. Clear cache and re-run

Corrupted cache is responsible for 15-20% of mysterious failures. If your build is failing with strange errors that don’t make sense, try clearing the cache.

When to suspect: Intermittent failures, works locally, error mentions “integrity” or “checksum”

How to clear:

  • GitHub Actions: Delete cache via settings or change cache key
  • CircleCI: Use “Rerun workflow from failed” with “Rerun job with SSH” then clear cache
  • Jenkins: Clean workspace before build
  • Travis CI: Clear cache via repository settings

2. Check external service status

Is the problem you, or is it them? npm registry down? GitHub having issues? Docker Hub rate limits?

Status pages to check:

If a service is degraded, wait for resolution or implement a retry strategy.

3. Verify secrets and credentials

Expired tokens and rotated credentials cause 10-15% of failures. Common signs: 401/403 errors, authentication failures, or “permission denied” messages.

How to check without exposing secrets:

# Check if secret is set (without printing value)
[ -z "$SECRET_NAME" ] && echo "Secret not set" || echo "Secret exists"

# For tokens, check if they're expired (example for JWT)
echo $JWT_TOKEN | cut -d'.' -f2 | base64 -d | jq .exp

Common issues:

  • Token expired or was rotated
  • Secret name typo (check for case sensitivity)
  • Secret not configured for the branch/environment

4. Compare with last successful run

Often a small change broke the build. Use git or CI comparison tools to see what changed between the last successful run and the current failure.

# Compare current commit with last successful build
git diff <last-successful-commit> HEAD

# Check recent commits
git log --oneline -10

Look for:

  • Dependency updates (package.json, requirements.txt, go.mod)
  • Configuration changes (.yml, .json, .env)
  • Environment variable changes
  • CI pipeline configuration edits

5. Check for concurrent builds

Multiple builds running simultaneously can fight over shared resources or cause race conditions.

Common issues:

  • Parallel jobs writing to same file/database
  • Port conflicts (two jobs trying to bind to same port)
  • Shared cache corruption

How to check:

  • Look for other running builds in your CI provider
  • Check if failure only happens with parallel jobs
  • Review build concurrency settings

6. Review recent dependency updates

Lockfile changes or “latest” version tags can introduce breaking changes without you realizing it.

# Check what dependency versions changed
git diff HEAD~1 package-lock.json
git diff HEAD~1 poetry.lock
git diff HEAD~1 go.sum

Red flags:

  • Major version bumps (2.x.x → 3.0.0)
  • Using ^ or ~ ranges without lockfile
  • “latest” tags in dependencies
  • Transitive dependency updates

Quick fix: Pin to the last known working version temporarily.

7. Try locally first

Can you reproduce the failure on your machine? This tells you whether it’s a code issue or an environment issue.

Can reproduce locally? → It’s likely a code/test issue. Fix in your local environment.

Cannot reproduce locally? → It’s likely an environment difference. Check versions, environment variables, OS differences.

Tips for reproducing:

  • Use the same runtime versions (Node, Python, etc.)
  • Run with same environment variables
  • Use Docker to match CI environment exactly

Dependency and Package Manager Issues

Package manager failures account for 25% of CI failures. These often work locally because your machine has cached versions or different lockfile states.

Common Error Patterns:

Pattern: “404 Not Found - Package doesn’t exist”

Example:

npm ERR! 404 Not Found - GET https://registry.npmjs.org/@company/private-pkg
npm ERR! 404  '@company/private-pkg@^1.2.0' is not in the npm registry

What it means: The package manager can’t find the package. Common causes: typo in package name, private package without registry authentication, or package was unpublished.

How to diagnose:

  • Check package name spelling in package.json/requirements.txt/go.mod
  • Verify registry URL is correct for private packages
  • Check if authentication token is set and valid
  • Search the package on the public registry to confirm it exists

How to fix:

For npm private packages:

# Ensure auth token is set (in CI environment variables)
npm config set //registry.npmjs.org/:_authToken=${NPM_TOKEN}

# Or use .npmrc file
echo "//registry.npmjs.org/:_authToken=\${NPM_TOKEN}" > .npmrc

For Python private packages:

# Configure pip to use private index
pip config set global.extra-index-url https://${PYPI_TOKEN}@pypi.company.com/simple

For Go private packages:

# Configure GOPRIVATE
go env -w GOPRIVATE=github.com/company/*
Annotated npm 404 error example

Pattern: “Integrity checksum failed”

Example:

npm ERR! Integrity check failed for package-lock.json
npm ERR! sha512-abc123... integrity checksum failed

What it means: The downloaded package doesn’t match the expected hash. Usually indicates: corrupted cache, registry issues, or package-lock.json out of sync with package.json.

How to diagnose:

  • Check if package-lock.json was manually edited
  • Verify registry is accessible and returning correct packages
  • Look for merge conflicts in lockfile

How to fix:

For npm:

# Clear npm cache and reinstall
npm cache clean --force
rm -rf node_modules package-lock.json
npm install

For Python:

# Clear pip cache
pip cache purge
rm -rf requirements.txt
pip install -r requirements.txt

For Go:

# Clear module cache
go clean -modcache
rm go.sum
go mod download

Pattern: “Version conflict / Could not resolve dependencies”

Example:

npm ERR! ERESOLVE could not resolve
npm ERR! peer react@"^18.0.0" from [email protected]
npm ERR! but project requires react@"^17.0.0"

What it means: Multiple packages require incompatible versions of the same dependency (peer dependency conflict).

How to diagnose:

  • Read the full error to identify conflicting packages
  • Use dependency tree tools to see the conflict:
    npm ls react        # npm
    pip show package    # pip
    go mod graph        # go
  • Check if recent updates introduced the conflict

How to fix:

For npm (use overrides):

// package.json
{
  "overrides": {
    "react": "^18.0.0"
  }
}

For npm (use legacy peer deps):

npm install --legacy-peer-deps

For Python (use constraints):

# constraints.txt
package-name<2.0.0,>=1.5.0

For Go (use replace directive):

// go.mod
replace github.com/old/package => github.com/new/package v1.2.3

Pattern: “Lockfile is out of date”

Example:

error package-lock.json: outdated lockfile
The package-lock.json doesn't match package.json

What it means: The lockfile doesn’t reflect the current state of package.json. Usually from manual package.json edits or merge conflicts.

How to diagnose:

  • Check git history for package.json changes without lockfile updates
  • Look for merge conflict markers in lockfile

How to fix:

# Regenerate lockfile
npm install           # npm (updates package-lock.json)
pip freeze > requirements.txt  # pip (updates requirements.txt)
go mod tidy          # go (updates go.sum)
poetry lock          # poetry (updates poetry.lock)

# Commit the updated lockfile
git add package-lock.json
git commit -m "Update lockfile"

Flaky and Timing-Dependent Tests

Flaky tests erode trust in CI and slow teams down. 40% of teams report flaky test problems. These tests pass sometimes and fail other times with the same code.

Common Error Patterns:

Pattern: “Test timeout exceeded”

Example:

FAIL src/api.test.js (45.234s)
  ✕ should return user data (30001ms)

Error: Timeout - Async callback was not invoked within the 30000ms timeout

What it means: A test took longer than the configured timeout. Usually waiting for something that never happens (network call, element to appear, promise to resolve).

How to diagnose:

  • Identify which specific test/assertion times out
  • Check if test makes network calls without mocks
  • Look for infinite waits or missing completion signals
  • Check if timeout only happens in CI (slower than local)

How to fix:

Increase timeout selectively (not globally):

// Jest
test('slow operation', async () => {
  // ...
}, 60000); // 60 second timeout for this test only

// Cypress
cy.get('.slow-element', { timeout: 10000 })

Better: Mock external calls:

// Mock fetch/axios to avoid real network calls
jest.mock('axios');
axios.get.mockResolvedValue({ data: mockData });

Best: Use proper async patterns:

// Bad: arbitrary wait
await new Promise(resolve => setTimeout(resolve, 5000));

// Good: wait for condition
await waitFor(() => expect(element).toBeVisible());
Annotated test timeout error example

Pattern: “Intermittent assertion failures”

Example:

Expected: "Processing complete"
Received: "Processing..."

This test passes sometimes but fails randomly.

What it means: Race condition or timing dependency. Test doesn’t wait long enough for async operations, or depends on timing that varies.

How to diagnose:

  • Run test 10 times locally: npm test -- --testNamePattern="flaky test" --maxWorkers=1
  • Check for async operations without proper awaits
  • Look for tests that depend on execution order
  • Check if test shares state with other tests

How to fix:

Add proper waits:

// Bad: hope it's done by now
await someAsyncFunction();
expect(result).toBe('done');

// Good: wait for specific condition
await waitFor(() => {
  expect(getResult()).toBe('done');
}, { timeout: 5000 });

Isolate test data:

// Bad: shared state
const user = { id: 1, name: 'Test' };

// Good: unique per test
const user = { id: Date.now(), name: `Test-${uuid()}` };

Pattern: “Test passes locally but fails in CI”

Example:

Expected date: 2024-02-08T10:30:00Z
Received date: 2024-02-08T18:30:00Z

Passes on developer's machine but fails in CI

What it means: Environment differences causing failures. Common culprits: timezone differences, parallel execution, resource constraints.

How to diagnose:

  • Check timezone: echo $TZ vs CI environment
  • Compare resource limits (memory, CPU)
  • Check if failure only happens with parallel test execution
  • Look for hardcoded values that depend on environment

How to fix:

Set explicit timezone:

# GitHub Actions
env:
  TZ: UTC

# CircleCI
environment:
  TZ: "/usr/share/zoneinfo/UTC"

Disable test parallelization temporarily:

# Jest
npm test -- --maxWorkers=1

# pytest
pytest -n 0

Use relative dates:

// Bad: hardcoded date
const expectedDate = new Date('2024-02-08');

// Good: relative to now
const expectedDate = new Date();
expectedDate.setHours(0, 0, 0, 0);

Pattern: “Random order failures”

Example:

✓ test A passes
✓ test B passes
✕ test C fails

When run in different order, test C passes

What it means: Tests have shared state or side effects. One test leaves data/state that affects another.

How to diagnose:

  • Run tests in isolation: each test should pass alone
  • Check for global variables or singletons
  • Look for database records not cleaned up
  • Check for filesystem changes persisting

How to fix:

Use proper setup/teardown:

// Jest
beforeEach(() => {
  // Reset database
  database.clear();
  // Reset mocks
  jest.clearAllMocks();
});

afterEach(() => {
  // Clean up
  cleanup();
});

Database transactions:

// Wrap each test in transaction and rollback
beforeEach(async () => {
  await db.beginTransaction();
});

afterEach(async () => {
  await db.rollback();
});

Randomize test order to catch issues:

# Jest
npm test -- --randomize

# pytest
pytest --random-order

Environment Mismatches

“Works on my machine” is the classic developer problem. Environment parity eliminates 30-50% of these failures.

Common Error Patterns:

Pattern: “Command not found / executable not in PATH”

Example:

/bin/sh: node: command not found
/bin/sh: python: command not found
bash: docker: command not found

What it means: The tool or executable isn’t installed in the CI environment, or it’s installed but not in the PATH.

How to diagnose:

  • Check if tool is installed: which node in CI debug session
  • Verify PATH environment variable includes tool location
  • Compare CI environment with local: echo $PATH

How to fix:

GitHub Actions - use setup actions:

- uses: actions/setup-node@v4
  with:
    node-version: '20'

- uses: actions/setup-python@v5
  with:
    python-version: '3.11'

CircleCI - use docker image with tools:

docker:
  - image: cimg/node:20.10

Generic - install manually:

# Install in pipeline
apt-get update && apt-get install -y nodejs

Pattern: “Wrong version of tool/runtime”

Example:

Error: Requires Node.js >= 18.0.0
Current: v16.14.0

Works locally (Node 20) but fails in CI (Node 16)

What it means: CI is using a different version than your local environment.

How to diagnose:

  • Check version in CI: node --version, python --version, go version
  • Compare with local version
  • Check if version is specified in CI config

How to fix:

Specify exact version in CI config:

# GitHub Actions
- uses: actions/setup-node@v4
  with:
    node-version-file: '.nvmrc'  # Read from .nvmrc

# CircleCI
docker:
  - image: cimg/node:20.10.0  # Exact version

Create version file:

# .nvmrc for Node
echo "20.10.0" > .nvmrc

# .python-version for Python
echo "3.11.5" > .python-version

# .tool-versions for asdf
echo "nodejs 20.10.0" > .tool-versions

Pattern: “Missing environment variable”

Example:

Error: API_KEY is not defined
KeyError: 'DATABASE_URL'
panic: $AWS_REGION not set

What it means: Code expects an environment variable that isn’t set in CI, or there’s a typo in the variable name.

How to diagnose:

  • Check secret configuration in CI provider
  • Verify variable name matches exactly (case-sensitive)
  • Check if secret is configured for the branch/environment
  • Look for typos in variable reference

How to fix:

Set secret in CI provider:

  • GitHub Actions: Settings → Secrets and variables → Actions → New repository secret
  • CircleCI: Project Settings → Environment Variables
  • Jenkins: Manage Jenkins → Credentials
  • Travis CI: Settings → Environment Variables

Reference correctly in config:

# GitHub Actions
env:
  API_KEY: ${{ secrets.API_KEY }}

# CircleCI
environment:
  API_KEY: ${API_KEY}

Add defaults for non-secret values:

const apiKey = process.env.API_KEY || 'default-for-testing';
Annotated environment variable error example

Pattern: “File/directory not found”

Example:

ENOENT: no such file or directory, open '/app/config.json'
FileNotFoundError: [Errno 2] No such file or directory: './data.csv'

What it means: Code references a file path that doesn’t exist in CI. Common causes: case sensitivity differences (macOS is case-insensitive, Linux is case-sensitive), working directory is different, or file wasn’t committed to git.

How to diagnose:

  • Check if file exists in repository: git ls-files | grep config.json
  • Verify file case matches exactly: Config.json vs config.json
  • Check current working directory: pwd in CI logs
  • Look for .gitignore excluding the file

How to fix:

Use correct case:

// Bad (might work on macOS, fail on Linux)
import config from './Config.json';

// Good (exact case)
import config from './config.json';

Set working directory explicitly:

# GitHub Actions
- run: npm test
  working-directory: ./app

# CircleCI
- run:
    command: npm test
    working_directory: ~/project/app

Use absolute or resolved paths:

// Bad: relative path may break
const dataPath = './data.csv';

// Good: resolve from known location
const path = require('path');
const dataPath = path.join(__dirname, 'data.csv');

Permission and Access Errors

Permission errors often stem from credential expiry, insufficient scopes, or misconfigured access controls. These are among the most frustrating because the error messages are often vague.

Common Error Patterns:

Pattern: “Permission denied / Access forbidden”

Example:

HTTP 403 Forbidden
fatal: could not read from remote repository
docker: denied: requested access to the resource is denied

What it means: You’re authenticated but don’t have permission to perform the action. Common causes: expired credentials, insufficient permissions/scopes, or wrong access level.

How to diagnose:

  • Check token expiry date (if JWT, decode and check exp field)
  • Verify token has required scopes/permissions
  • Confirm user/service account has necessary roles
  • Check if resource access is restricted by IP, branch, or environment

How to fix:

For GitHub:

# Ensure token has correct permissions
# Settings → Developer settings → Personal access tokens
# Required scopes: repo, workflow, write:packages

For Docker registries:

# Login with credentials
echo $DOCKER_TOKEN | docker login -u $DOCKER_USER --password-stdin

# For private registries, specify registry
echo $TOKEN | docker login registry.company.com -u $USER --password-stdin

For AWS:

# Verify IAM role/user has required permissions
# Check policy attached to credentials being used
aws sts get-caller-identity  # Verify who you're authenticated as

Update credential scopes and rotate:

# Generate new token with correct scopes
# Update secret in CI provider
# Test with new token
Annotated permission denied error example

Pattern: “Authentication failed”

Example:

HTTP 401 Unauthorized
error: Authentication failed for 'https://github.com/user/repo.git'
Login failed: invalid credentials

What it means: Authentication credentials are wrong, missing, or incorrectly formatted.

How to diagnose:

  • Check if secret is set: [ -z "$TOKEN" ] && echo "Not set"
  • Verify secret name matches variable reference (case-sensitive)
  • Check token format (Bearer token, Basic auth, API key)
  • Look for whitespace or newlines in secret value

How to fix:

Verify secret format:

# For Bearer tokens
Authorization: Bearer <token>

# For Basic auth (base64 encoded username:password)
Authorization: Basic <base64-encoded-credentials>

# For API keys
X-API-Key: <api-key>

Update secret in CI:

# GitHub Actions - verify secret is set
- run: |
    if [ -z "${{ secrets.API_TOKEN }}" ]; then
      echo "API_TOKEN not set"
      exit 1
    fi

Test authentication manually:

# Test with curl to verify credentials work
curl -H "Authorization: Bearer $TOKEN" https://api.example.com/test

Pattern: “Rate limit exceeded”

Example:

HTTP 429 Too Many Requests
API rate limit exceeded for <IP address>
You have exceeded your pull rate limit (Docker Hub)

What it means: You’ve made too many requests to an API or service. Unauthenticated requests often have lower limits.

How to diagnose:

  • Check rate limit headers in response: X-RateLimit-Remaining, X-RateLimit-Reset
  • Verify you’re authenticating (authenticated = higher limits)
  • Count how many requests your pipeline makes
  • Check if multiple builds are running concurrently

How to fix:

Authenticate to get higher limits:

# GitHub Actions - automatic token for higher limits
- uses: actions/checkout@v4
  with:
    token: ${{ secrets.GITHUB_TOKEN }}

# Docker Hub - authenticate for higher limits
- run: echo ${{ secrets.DOCKER_PASSWORD }} | docker login -u ${{ secrets.DOCKER_USERNAME }} --password-stdin

Implement caching:

# Cache dependencies to reduce requests
- uses: actions/cache@v3
  with:
    path: ~/.npm
    key: ${{ runner.os }}-node-${{ hashFiles('**/package-lock.json') }}

Add retry logic with backoff:

# Retry on rate limit with exponential backoff
for i in {1..5}; do
  if curl -f https://api.example.com/resource; then
    break
  fi
  echo "Rate limited, waiting..."
  sleep $((2**i))
done

Use mirrors or proxies:

# npm registry mirror
npm config set registry https://registry.npmmirror.com

# PyPI mirror
pip install --index-url https://pypi.tuna.tsinghua.edu.cn/simple

Pattern: “EACCES: permission denied (file system)”

Example:

EACCES: permission denied, open '/usr/local/bin/tool'
PermissionError: [Errno 13] Permission denied: '/root/.config'
cannot create directory: Permission denied

What it means: File system permission issue. Can’t write to directory, execute file, or access resource due to Unix permissions.

How to diagnose:

  • Check file permissions: ls -la /path/to/file
  • Verify which user is running: whoami, id
  • Check directory ownership: ls -ld /path/to/directory
  • Look for writing to protected directories (/usr, /root)

How to fix:

Fix permissions:

# Make file executable
chmod +x ./script.sh

# Change ownership
sudo chown $USER:$USER /path/to/directory

# Create directory with proper permissions
mkdir -p ~/.config
chmod 755 ~/.config

Use user-writable locations:

# Bad: writing to system directory
npm install -g package

# Good: use user directory
npm install --prefix ~/.local package

Run as correct user:

# GitHub Actions - runs as runner user by default (good)

# Docker - run as non-root user
docker run --user $(id -u):$(id -g) image:tag

Use sudo when necessary (CI environments usually allow):

# Install system dependency
sudo apt-get update && sudo apt-get install -y build-essential

Resource and Timeout Problems

Resource constraints in CI environments can cause failures that never happen locally where you have more memory, disk space, and time.

Common Error Patterns:

Pattern: “Out of memory / OOM killed”

Example:

FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
Killed (process received SIGKILL - likely OOM)
The job exceeded the maximum memory limit
Annotated out of memory error example

What it means: Process used more memory than available and was killed by the system. Common in: large builds, memory leaks, processing large datasets.

How to diagnose:

  • Check memory limits in CI provider
  • Look for memory usage spikes in CI logs
  • Profile memory usage locally
  • Check for obvious leaks (holding references, not cleaning up)

How to fix:

Increase memory allocation:

# Node.js - increase heap size
NODE_OPTIONS=--max-old-space-size=4096 npm run build

# Java - increase heap
JAVA_OPTS="-Xmx4g" ./gradlew build

Process data in chunks:

// Bad: load entire file into memory
const data = fs.readFileSync('huge-file.json', 'utf8');
const parsed = JSON.parse(data);

// Good: stream and process in chunks
const stream = fs.createReadStream('huge-file.json');
stream.on('data', chunk => processChunk(chunk));

Fix memory leaks:

// Bad: accumulating references
const cache = [];
function process(item) {
  cache.push(item); // Never cleared!
}

// Good: bounded cache
const cache = new Map();
function process(item) {
  if (cache.size > 1000) {
    cache.clear();
  }
  cache.set(item.id, item);
}

Request more resources:

# CircleCI - use larger resource class
resource_class: large  # 4GB memory

# GitHub Actions - use larger runner
runs-on: ubuntu-latest-4-cores  # More memory

Pattern: “Disk space full”

Example:

ENOSPC: no space left on device
Error: No space left on device (os error 28)
docker: write /var/lib/docker: no space left on device

What it means: The CI runner’s disk is full. Common causes: large dependencies, build artifacts, Docker layers, old cache.

How to diagnose:

  • Check disk usage in CI: df -h
  • Find large files: du -h --max-depth=1 | sort -hr
  • Check Docker disk usage: docker system df
  • Look for accumulating artifacts

How to fix:

Clean up before build:

# Remove old artifacts
rm -rf dist/ build/ *.log

# Clean Docker
docker system prune -af --volumes

# Clean package manager caches
npm cache clean --force
pip cache purge

Use smaller base images:

# Bad: large base image
FROM node:20

# Good: slim variant
FROM node:20-slim

# Better: alpine
FROM node:20-alpine

Optimize Docker layers:

# Bad: each RUN creates a layer
RUN apt-get update
RUN apt-get install -y package1
RUN apt-get install -y package2

# Good: combine and clean up
RUN apt-get update && \
    apt-get install -y package1 package2 && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

Configure artifact retention:

# GitHub Actions - delete artifacts after 7 days
- uses: actions/upload-artifact@v3
  with:
    name: build
    path: dist/
    retention-days: 7

Pattern: “Build/job timeout exceeded”

Example:

The job running on runner <id> has exceeded the maximum execution time of 60 minutes
Error: Job timeout reached (2 hours)

What it means: Pipeline exceeded the maximum allowed time. Usually slow tests, inefficient builds, or hanging processes.

How to diagnose:

  • Check which step takes longest in CI logs
  • Look for hanging processes (waiting on input, deadlock)
  • Profile test execution: npm test -- --verbose --coverage
  • Check for network timeouts or retries

How to fix:

Optimize slow steps:

# Parallelize tests
- run: npm test -- --maxWorkers=4

# Cache dependencies
- uses: actions/cache@v3
  with:
    path: ~/.npm
    key: ${{ runner.os }}-${{ hashFiles('**/package-lock.json') }}

Set timeouts appropriately:

# GitHub Actions - set job timeout
jobs:
  build:
    timeout-minutes: 30  # Fail fast instead of waiting default 6 hours

# Set step timeout
- run: npm test
  timeout-minutes: 10

Identify and fix hanging processes:

# Add timeout to commands
timeout 300 npm test  # Kill after 5 minutes

# Use --bail to stop on first failure
npm test -- --bail

Split into multiple jobs:

# Bad: one huge job
- run: |
    npm run build
    npm test
    npm run lint
    npm run deploy

# Good: parallel jobs
jobs:
  build: ...
  test: ...
  lint: ...
  deploy:
    needs: [build, test, lint]

Pattern: “Too many open files”

Example:

EMFILE: too many open files
Error: open /path/to/file: too many open files in system
ulimit: open files: cannot modify limit: Operation not permitted

What it means: Process exceeded the limit of open file descriptors. Common in: file watchers, parallel operations, file descriptor leaks.

How to diagnose:

  • Check open files: lsof -p $PID | wc -l
  • Check limit: ulimit -n
  • Look for file watchers (Jest, Webpack, test frameworks)
  • Search for file operations without closing

How to fix:

Close files properly:

// Bad: file descriptors leak
function readFile(path) {
  const fd = fs.openSync(path, 'r');
  const data = fs.readFileSync(fd);
  return data;  // fd never closed!
}

// Good: close file descriptor
function readFile(path) {
  const fd = fs.openSync(path, 'r');
  try {
    return fs.readFileSync(fd);
  } finally {
    fs.closeSync(fd);
  }
}

// Better: use higher-level APIs that handle closing
function readFile(path) {
  return fs.readFileSync(path, 'utf8');
}

Increase ulimit (if allowed):

# GitHub Actions
- run: |
    ulimit -n 4096
    npm test

# CircleCI
- run:
    command: npm test
    environment:
      NOFILE_LIMIT: 4096

Reduce file watchers:

// Jest - use --maxWorkers to limit watchers
"test": "jest --maxWorkers=2"

// Webpack - reduce watched files
watchOptions: {
  ignored: /node_modules/,
}

Provider-Specific Debugging Tools

Each CI provider has built-in debugging capabilities to help investigate failures. Here are the essential tools you should know:

CircleCI

Debugging Tools for CircleCI

CircleCI provides powerful debugging features including SSH access and insights dashboards to help diagnose failures.

1. SSH Rerun for Interactive Debugging

When to use: Need to inspect the CI environment interactively, investigate why a step fails, or debug environment-specific configuration issues.

How to enable:

  1. Go to your failed job in CircleCI dashboard
  2. Click “Rerun” dropdown in top right
  3. Select “Rerun job with SSH”
  4. Wait for job to start
  5. Copy SSH command from job output:
    SSH enabled. To connect:
    ssh -p PORT USERNAME@HOST

What you get:

  • Full SSH access to the running container
  • All environment variables and configuration
  • Ability to run failed commands manually
  • Inspect files and directory structure

Usage tips:

# Navigate to working directory
cd ~/project

# Check environment
printenv | grep -i api
env | sort

# Re-run failed command with modifications
npm test -- --verbose

# Check tool versions
node --version
npm --version

# Inspect files
ls -la
cat .circleci/config.yml

Important notes:

  • SSH session times out after 10 minutes of inactivity (configurable up to 2 hours)
  • Connection stays open for duration even if steps complete
  • Type exit or close terminal when done to free resources

2. Insights Dashboard

When to use: Identify patterns in failures, find flaky tests, track pipeline performance over time, or spot bottlenecks.

How to access:

  1. Go to CircleCI dashboard
  2. Click “Insights” in left sidebar
  3. Select your project

What you see:

  • Success rate over time: Spot trends in pipeline reliability
  • Duration trends: See if builds are getting slower
  • Flaky test detection: Tests that pass/fail intermittently
  • Most failed tests: Focus debugging efforts on problematic tests
  • Credit usage: Track resource consumption

Key metrics to monitor:

  • Pipeline success rate (target: >95%)
  • Mean time to recovery (MTTR)
  • Test flakiness percentage
  • Job duration trends (catch performance regressions)

3. Step Output and Timing

When to use: Identify which specific step is failing or taking too long, understand the execution flow, or optimize slow pipelines.

How to access: Built into every job view - each step is expandable with timing information.

What you get:

  • Timing for each step (helps identify bottlenecks)
  • Full stdout/stderr output
  • Color-coded output (errors in red)
  • Collapsible sections for readability

Debugging tips:

# Add timing for custom commands
- run:
    name: "Run tests with timing"
    command: |
      echo "Starting tests at $(date)"
      time npm test
      echo "Finished tests at $(date)"

# Add debug output
- run:
    name: "Debug environment"
    command: |
      echo "=== Environment Variables ==="
      env | sort
      echo "=== Installed tools ==="
      node --version
      npm --version
      echo "=== Disk space ==="
      df -h

Store and access logs as artifacts:

- store_artifacts:
    path: test-results
    destination: test-results

- store_artifacts:
    path: /tmp/logs
    destination: logs

Access artifacts via:

  • Job page → Artifacts tab
  • Or direct URL: https://output.circle-artifacts.com/output/job/:job-id/artifacts/:container-index/:path

GitHub Actions

Debugging Tools for GitHub Actions

GitHub Actions provides several built-in debugging capabilities to help investigate failures quickly.

1. Enable Debug Logging

When to use: You need more verbose output to understand what’s happening between steps, or want to see hidden commands and environment setup.

How to enable:

Add these secrets to your repository (Settings → Secrets and variables → Actions):

  • ACTIONS_STEP_DEBUG = true - Shows detailed step-by-step execution logs
  • ACTIONS_RUNNER_DEBUG = true - Shows runner diagnostic information

What you get:

  • Detailed logs showing environment variable setup
  • Hidden commands executed by actions
  • Step-by-step execution trace
  • Runner diagnostic information

Example output difference:

Without debug logging:

Run actions/checkout@v4

With debug logging:

##[debug]Evaluating condition for step: 'Checkout code'
##[debug]Evaluating: success()
##[debug]Evaluating success:
##[debug]=> true
##[debug]Result: true
##[debug]Starting: Checkout code
Run actions/checkout@v4
##[debug]Getting Git version info
##[debug]Working directory is '/home/runner/work/repo'
##[debug]Running command: git --version

2. Re-run Jobs with SSH Access

When to use: You need to inspect the CI environment interactively, debug environment-specific issues, or investigate complex failures hands-on.

How to use:

Add the tmate action to your workflow (triggers on failure):

- name: Setup tmate session
  uses: mxschmitt/action-tmate@v3
  if: failure()
  timeout-minutes: 30

Or add it temporarily to debug a specific step:

- name: Debug with tmate
  uses: mxschmitt/action-tmate@v3
  if: always()  # Run even if previous steps succeed

What you get:

  • SSH access to the runner environment
  • Connection string displayed in logs
  • Ability to run commands interactively
  • Inspect files, environment variables, installed tools

Usage tips:

  • Remove or comment out before merging to production
  • Use timeout-minutes to prevent hanging
  • Set repository secret ACTIONS_STEP_DEBUG to see connection info sooner

3. Download and Inspect Artifacts

When to use: You need to examine build outputs, test reports, logs, or coverage data after the run completes.

How to access:

  1. Go to Actions tab in your repository
  2. Click on the workflow run
  3. Scroll to bottom to see Artifacts section
  4. Click to download zip file

Upload debugging artifacts:

- name: Upload test results
  uses: actions/upload-artifact@v4
  if: always()  # Upload even if tests fail
  with:
    name: test-results
    path: |
      test-results/
      coverage/
      **/*.log
    retention-days: 7

Common artifacts to upload for debugging:

  • Test results and reports (test-results/, junit.xml)
  • Code coverage reports (coverage/, htmlcov/)
  • Build logs (*.log, build.log)
  • Screenshots from E2E tests (screenshots/, cypress/screenshots/)
  • Application bundles or builds (dist/, build/)

Tips:

  • Use if: always() to upload even when steps fail
  • Set reasonable retention-days to save storage costs
  • Use descriptive artifact names for multiple uploads
  • Combine related files in single artifact to reduce clutter

Jenkins

Debugging Tools for Jenkins

Jenkins provides comprehensive debugging capabilities through console logs, pipeline replay, and workspace inspection.

1. Console Output

When to use: First stop for any Jenkins build failure. Provides complete build log with timestamps and color-coded output.

How to access:

  1. Navigate to your build (job → build number)
  2. Click “Console Output” in left sidebar
  3. Or append /console to build URL

What you get:

  • Complete stdout/stderr from entire build
  • Timestamps for each line (if configured)
  • ANSI color codes for readability
  • Full stack traces for errors
  • Environment variable output

Enable timestamps:

// Jenkinsfile
pipeline {
  options {
    timestamps()
  }
}

Debugging tips:

// Add debug output in pipeline
stage('Debug') {
  steps {
    script {
      echo "=== Environment Variables ==="
      sh 'env | sort'

      echo "=== Tool Versions ==="
      sh '''
        node --version
        npm --version
        docker --version
      '''

      echo "=== Disk Space ==="
      sh 'df -h'

      echo "=== Current Directory ==="
      sh 'pwd && ls -la'
    }
  }
}

Search console output:

  • Use browser’s Find (Ctrl/Cmd+F)
  • Download console log and search locally
  • Use Jenkins “Console Output” search feature if available

2. Pipeline Replay

When to use: Need to test pipeline changes without committing to repository, quickly iterate on fixes, or modify pipeline script to add debug output.

How to use:

  1. Go to a completed pipeline build
  2. Click “Replay” in left sidebar
  3. Modify the pipeline script directly in browser
  4. Click “Run” to execute modified version

What you get:

  • Ability to modify Jenkinsfile without committing
  • Test fixes quickly before pushing to git
  • Add debug statements temporarily
  • Try different configurations

Common debugging modifications:

// Original failing step
stage('Test') {
  steps {
    sh 'npm test'
  }
}

// Modified for debugging via Replay
stage('Test') {
  steps {
    // Add environment inspection
    sh '''
      echo "NODE_VERSION: $(node --version)"
      echo "NPM_VERSION: $(npm --version)"
      echo "PATH: $PATH"
    '''

    // Run with more verbose output
    sh 'npm test -- --verbose'

    // Or run subset of tests
    sh 'npm test -- --testPathPattern=failing-test.js'
  }
}

Important notes:

  • Replay uses the same commit, workspace, and parameters
  • Changes are NOT saved - commit to Jenkinsfile when working
  • Can replay multiple times with different modifications

3. Workspace Inspection

When to use: Need to examine build artifacts, inspect generated files, check directory structure, or understand what files were created during build.

How to access:

  1. Navigate to build (job → build number)
  2. Click “Workspace” in left sidebar
  3. Browse directory structure

Or use “Wipe Out Workspace” plugin to manually inspect:

// Add to Jenkinsfile to preserve workspace
options {
  skipDefaultCheckout(false)
  preserveStashes(buildCount: 5)
}

What you get:

  • Browse all files in workspace
  • Download individual files
  • View file contents directly in browser
  • Inspect build artifacts and logs

Access workspace via SSH/command line:

// Print workspace location for SSH access
stage('Debug') {
  steps {
    echo "Workspace: ${env.WORKSPACE}"
    sh 'echo "Access workspace at: $(hostname):$PWD"'
  }
}

Archive artifacts for later inspection:

post {
  always {
    archiveArtifacts artifacts: '''
      **/target/*.jar,
      **/build/libs/*.jar,
      **/*.log,
      test-results/**/*.xml
    ''', allowEmptyArchive: true
  }
}

Tips for workspace debugging:

  • Check directory structure: sh 'find . -type f | head -20'
  • Verify files were created: sh 'ls -la dist/ || echo "dist/ not found"'
  • Check file permissions: sh 'ls -la'
  • Search for files: sh 'find . -name "*.log"'

Travis CI

Debugging Tools for Travis CI

Travis CI provides debug mode, structured build logs, and build matrix views to help diagnose failures across different configurations.

1. Debug Mode

When to use: Need interactive access to the build environment, want to run commands manually, or need to inspect the environment that caused a failure.

How to enable:

Via Travis CI API:

# Trigger debug build via API
curl -s -X POST \
  -H "Content-Type: application/json" \
  -H "Accept: application/json" \
  -H "Travis-API-Version: 3" \
  -H "Authorization: token YOUR_TRAVIS_TOKEN" \
  -d '{"quiet": true}' \
  https://api.travis-ci.com/job/{job_id}/debug

Or restart a job with debug enabled:

  1. Go to your build on travis-ci.com
  2. Click “Restart build” → “Debug build”
  3. Build will start with SSH access enabled

What you get:

  • SSH connection string in build log
  • Access to build environment for 30 minutes
  • Ability to run commands interactively
  • Full access to environment variables and tools

Connect via SSH:

# Connection details appear in build log
ssh [email protected]

# Once connected, navigate to repo
cd ~/build/username/repo

# Run failed commands with modifications
bundle exec rake test --verbose

Important notes:

  • Debug mode available on private repositories
  • Session expires after 30 minutes
  • Build doesn’t automatically proceed - you control execution

2. Build Logs with Fold Sections

When to use: First place to look for any Travis build failure. Logs are structured with collapsible sections for easy navigation.

How to access:

  1. Click on your build in Travis CI dashboard
  2. Logs appear automatically in main view
  3. Click sections to expand/collapse

What you see:

  • Worker info: VM image, environment details
  • System info: OS, kernel version, tools
  • Before install: Dependency setup
  • Install: Package installation
  • Before script: Pre-test setup
  • Script: Main build/test commands
  • After success/failure: Post-build steps

Add custom fold sections:

# .travis.yml
script:
  - echo -e "travis_fold:start:tests"
  - echo "Running tests..."
  - npm test
  - echo -e "travis_fold:end:tests"

  - echo -e "travis_fold:start:linting"
  - echo "Running linter..."
  - npm run lint
  - echo -e "travis_fold:end:linting"

Debugging tips:

# Add verbose output
script:
  - echo "=== Environment Check ==="
  - node --version && npm --version
  - echo "=== Disk Space ==="
  - df -h
  - echo "=== Running Tests ==="
  - npm test -- --verbose

3. Build Matrix View

When to use: Testing across multiple configurations (Node versions, OS, etc.) and need to identify which specific configuration is failing.

How to access: Build page automatically shows matrix if configured in .travis.yml

What you get:

  • Visual grid of all build combinations
  • Quick identification of failing configurations
  • Compare successful vs failed builds
  • Isolate environment-specific issues

Example matrix configuration:

language: node_js
node_js:
  - '16'
  - '18'
  - '20'
os:
  - linux
  - osx
  - windows

This creates 9 builds:

  • Node 16 + Linux
  • Node 16 + macOS
  • Node 16 + Windows
  • Node 18 + Linux
  • Node 18 + macOS
  • … etc

Debugging matrix failures:

If only specific combinations fail:

# Exclude known problematic combinations
jobs:
  exclude:
    - os: windows
      node_js: '16'  # Known issue with Windows + Node 16

  # Or allow specific failures
  allow_failures:
    - os: windows  # Windows builds can fail without blocking

Access matrix-specific environment:

script:
  - echo "Testing on $TRAVIS_OS_NAME with Node $(node --version)"
  - npm test

Tips for matrix debugging:

  • Start with minimal matrix (1-2 combinations) to isolate
  • Add configurations incrementally
  • Use allow_failures for unstable environments
  • Compare logs between passing and failing matrix builds