Shell Course 3

Chapter 3: Working with Text

## The Power of Text Manipulation

Text processing is where the shell truly shines. In our increasingly data-driven world, the ability to extract, transform, and analyze text efficiently is invaluable. Whether you’re parsing log files, cleaning data sets, or automating document processing, the shell offers a remarkable toolkit for text manipulation.

Text in Unix systems is rather like water in the natural world - it flows between commands, can be filtered, redirected, and transformed. Understanding how to control this flow gives you tremendous power over your computing environment.

Let’s dive deeper into the tools that make text processing in the shell so powerful.


## Input and Output Redirection

Before we explore text processing commands, we need to understand how to control where text comes from and where it goes.

Standard Streams

The shell uses three standard “streams” for input and output:

  • stdin (0): Standard input - where commands read their input
  • stdout (1): Standard output - where commands send their normal output
  • stderr (2): Standard error - where commands send error messages

Redirecting Output

To save command output to a file instead of displaying it on the screen:

1
2
$ ls -l > file_list.txt      # Save output to file (overwrites existing file)
$ ls -l >> file_list.txt # Append output to file (preserves existing content)

To redirect error messages:

1
2
$ find / -name "*.conf" 2> errors.txt    # Save only errors to file
$ find / -name "*.conf" 2> /dev/null # Discard errors entirely

To redirect both standard output and errors to the same file:

1
$ ls -l non_existent_file > output.txt 2>&1    # Redirect both to output.txt

A modern, cleaner syntax for redirecting both streams:

1
$ ls -l non_existent_file &> output.txt    # Redirect both to output.txt

Redirecting Input

To use a file as input to a command:

1
$ sort < unsorted_list.txt     # Use file content as input to sort

Redirection allows you to create pipelines that transform your data step by step - rather like a factory production line where each station performs a specific operation on the materials passing through it.


## Pipes: Connecting Commands

The pipe operator (|) connects the output of one command to the input of another, allowing you to build powerful command chains.

1
$ ls -l | grep "Mar"    # List files and filter for those containing "Mar"

Pipes can be chained to create complex data processing workflows:

1
$ cat access.log | grep "ERROR" | sort | uniq -c | sort -nr

This pipeline:

  1. Reads the log file
  2. Filters for lines containing “ERROR”
  3. Sorts the matching lines
  4. Counts unique occurrences
  5. Sorts numerically in reverse order (most frequent first)

Thinking in pipelines is a fundamental skill for shell mastery. Each command does one thing well, and their power comes from combining them in creative ways.


## Text Processing Commands

Now let’s explore the essential text processing tools in your shell toolkit.

The Swiss Army Knives: grep, sed, and awk

These three commands form the cornerstone of text processing in the shell. They each deserve their own book, but we’ll cover the essentials.

We introduced grep in the previous chapter, but let’s explore it further:

1
2
3
4
5
6
7
8
9
10
$ grep "pattern" file.txt               # Find lines matching pattern
$ grep -i "pattern" file.txt # Case-insensitive search
$ grep -v "pattern" file.txt # Find lines NOT matching pattern
$ grep -n "pattern" file.txt # Show line numbers with matches
$ grep -c "pattern" file.txt # Count matching lines
$ grep -o "pattern" file.txt # Show only the matched part
$ grep -E "pattern1|pattern2" file.txt # Extended grep (supports OR)
$ grep -A 2 "pattern" file.txt # Show 2 lines after match
$ grep -B 2 "pattern" file.txt # Show 2 lines before match
$ grep -C 2 "pattern" file.txt # Show 2 lines before and after match

The name grep comes from the ed editor command g/re/p (globally search for a regular expression and print matching lines) - a delightful piece of computing archaeology that reminds us how command names that seem arbitrary today often have perfectly logical historical origins.

sed: Stream Editor

sed is designed for transforming text with search-and-replace operations:

1
2
3
4
5
$ sed 's/old/new/' file.txt             # Replace first occurrence on each line
$ sed 's/old/new/g' file.txt # Replace all occurrences
$ sed 's/old/new/gi' file.txt # Replace all occurrences, case-insensitive
$ sed '1,5s/old/new/g' file.txt # Replace only in lines 1-5
$ sed '/pattern/s/old/new/g' file.txt # Replace only in lines matching pattern

By default, sed prints every line, modified or not. To print only modified lines:

1
$ sed -n 's/old/new/p' file.txt         # Print only lines that were changed

To delete lines:

1
2
3
$ sed '5d' file.txt                     # Delete line 5
$ sed '/pattern/d' file.txt # Delete lines matching pattern
$ sed '1,5d' file.txt # Delete lines 1-5

Remember that sed doesn’t modify the original file. To save changes:

1
2
$ sed 's/old/new/g' file.txt > new_file.txt    # Save to new file
$ sed -i 's/old/new/g' file.txt # Modify in place (be careful!)

The -i option for in-place editing is like performing surgery without a backup plan. Consider using -i.bak instead, which creates a backup file with the .bak extension:

1
$ sed -i.bak 's/old/new/g' file.txt     # Creates file.txt.bak before modifying

awk: Text Processing Language

While grep finds patterns and sed performs replacements, awk is a full-featured text processing language for more complex transformations:

1
2
3
4
5
$ awk '{print $1}' file.txt             # Print first column of each line
$ awk '{print $1, $3}' file.txt # Print first and third columns
$ awk '{print $NF}' file.txt # Print last column
$ awk '{print NF}' file.txt # Print number of columns in each line
$ awk '{sum += $1} END {print sum}' file.txt # Sum values in first column

Pattern-action pairs make awk particularly powerful:

1
2
3
$ awk '/pattern/ {print $1}' file.txt        # Print first column of matching lines
$ awk '$3 > 100 {print $1, $3}' file.txt # Print when third column exceeds 100
$ awk 'NR==10, NR==20 {print}' file.txt # Print lines 10-20

awk can format output with printf:

1
$ awk '{printf "Name: %-10s Age: %d\n", $1, $2}' people.txt

Learning awk is rather like discovering that your adjustable spanner is actually a complete workshop in disguise. What seems like a simple command at first reveals itself to be a full programming language with variables, functions, and control structures.

Sorting and Uniqueness

sort: Arranging Text

The sort command arranges lines of text:

1
2
3
4
5
6
$ sort file.txt                     # Sort alphabetically
$ sort -r file.txt # Reverse sort
$ sort -n file.txt # Numerical sort
$ sort -k 2 file.txt # Sort by second column
$ sort -k 2,2 -k 1,1 file.txt # Sort by second column, then first
$ sort -t, -k2,2n data.csv # Sort CSV by second column numerically

uniq: Finding Unique Lines

The uniq command reports or filters out repeated lines:

1
2
3
4
5
$ uniq file.txt                   # Remove adjacent duplicate lines
$ sort file.txt | uniq # Remove all duplicates (must sort first)
$ uniq -c file.txt # Count occurrences of each line
$ uniq -d file.txt # Show only duplicate lines
$ uniq -u file.txt # Show only unique lines

A common pattern for finding the most frequent items:

1
$ sort file.txt | uniq -c | sort -nr    # Count and sort by frequency

Cutting and Pasting Text

cut: Extracting Columns

The cut command extracts sections from each line:

1
2
3
$ cut -c 1-5 file.txt              # Characters 1-5 from each line
$ cut -d, -f 1,3 data.csv # Fields 1 and 3 from CSV file
$ cut -d: -f1 /etc/passwd # Extract usernames from passwd file

paste: Merging Lines

The paste command merges lines from different files:

1
2
$ paste file1.txt file2.txt         # Combine lines side by side with tabs
$ paste -d, file1.txt file2.txt # Combine using comma as delimiter

Text Transformation

tr: Translating Characters

The tr command translates or deletes characters:

1
2
3
4
$ cat file.txt | tr 'a-z' 'A-Z'      # Convert to uppercase
$ cat file.txt | tr -d '\r' # Remove carriage returns
$ cat file.txt | tr -s ' ' # Squeeze multiple spaces into one
$ cat file.txt | tr '[:punct:]' ' ' # Replace punctuation with spaces

rev: Reversing Lines

The rev command reverses characters in each line:

1
$ echo "hello" | rev                # Outputs "olleh"

While this might seem like a party trick, rev can be surprisingly useful for certain text processing tasks, such as extracting parts of filenames from the end.

fold: Wrapping Lines

The fold command wraps lines to a specified width:

1
2
$ fold -w 80 file.txt               # Wrap at 80 characters
$ fold -s -w 80 file.txt # Wrap at 80 characters at word boundaries

## Text Editors in the Shell

Sometimes you need to edit text directly rather than processing it through pipes. Let’s explore the most common text editors available in the shell.

nano: Beginner-Friendly Editor

Nano is a simple, user-friendly editor that displays commands at the bottom of the screen:

1
$ nano file.txt

Common commands (^ means Ctrl):

  • ^O: Save file
  • ^X: Exit
  • ^K: Cut line
  • ^U: Paste
  • ^W: Search
  • ^G: Get help

Nano is perfect for quick edits when you don’t want to deal with the learning curve of more powerful editors.

vim: The Programmer’s Editor

Vim is a powerful, modal editor with a steeper learning curve but tremendous capabilities:

1
$ vim file.txt

Vim has different modes:

  • Normal mode: For navigation and commands (default)
  • Insert mode: For typing text (press i to enter)
  • Visual mode: For selecting text (press v to enter)
  • Command mode: For saving, quitting, etc. (press : to enter)

Vim has a reputation for being difficult to exit, to the point where “How to exit Vim” is a perennial favorite on programming forums. For the record, it’s :q! to quit without saving or :wq to save and quit - consider this your emergency exit information.

Basic vim commands (in normal mode):

  • i: Enter insert mode
  • Esc: Return to normal mode
  • :w: Save file
  • :q: Quit
  • :wq: Save and quit
  • :q!: Quit without saving
  • /pattern: Search for pattern
  • n: Next search result
  • dd: Delete line
  • yy: Copy line
  • p: Paste

emacs: The Extensible Editor

Emacs is another powerful editor with its own ecosystem:

1
$ emacs file.txt

Common commands:

  • C-x C-s: Save file (Ctrl+x, Ctrl+s)
  • C-x C-c: Exit
  • C-k: Cut line
  • C-y: Paste
  • C-s: Search forward

The vi vs. emacs debate is one of computing’s oldest religious wars, with passionate adherents on both sides. Choose wisely, or risk being drawn into debates that have been ongoing since the 1970s.


## Practical Examples

Let’s cement our understanding with some practical examples of text processing in the shell.

Example 1: Analyzing Log Files

Extract all ERROR messages from a log file, count their occurrences, and sort by frequency:

1
$ grep "ERROR" application.log | cut -d: -f4 | sort | uniq -c | sort -nr

Example 2: CSV Processing

Extract and format specific columns from a CSV file:

1
$ cat data.csv | cut -d, -f1,3,5 | sed 's/,/ | /g' | sort -k 2

Example 3: Finding Duplicate Files

List potential duplicate files based on size:

1
$ find . -type f -exec ls -l {} \; | awk '{print $5, $9}' | sort -n | uniq -D -f1

Example 4: Word Count in a Document

Count words in a document and find the most common ones:

1
$ cat document.txt | tr -cs '[:alpha:]' '\n' | tr '[:upper:]' '[:lower:]' | sort | uniq -c | sort -nr | head -10

This pipeline:

  1. Extracts words by converting non-letters to newlines
  2. Converts everything to lowercase
  3. Sorts the words
  4. Counts occurrences
  5. Sorts by frequency (most common first)
  6. Shows the top 10

Example 5: Converting File Formats

Convert a Windows text file to Unix format (removing carriage returns):

1
$ cat windows.txt | tr -d '\r' > unix.txt

If you’ve ever opened a Windows text file on a Unix system and seen those mysterious ^M characters at the end of each line, this command is your friend.


## Regular Expressions: The Secret Sauce

Many text processing tools use regular expressions (regex) for pattern matching. While a complete regex tutorial is beyond our scope, here are some fundamental patterns:

  • . - Matches any single character
  • ^ - Matches the start of a line
  • $ - Matches the end of a line
  • * - Matches zero or more of the previous character
  • + - Matches one or more of the previous character
  • ? - Matches zero or one of the previous character
  • [abc] - Matches any one of the characters in brackets
  • [^abc] - Matches any character NOT in brackets
  • \d - Matches a digit
  • \w - Matches a word character (alphanumeric)
  • \s - Matches a whitespace character

Examples in action:

1
2
3
4
$ grep "^#" file.txt           # Lines starting with #
$ grep "[0-9]\{3\}" file.txt # Lines containing 3 consecutive digits
$ sed 's/[0-9]\+/NUMBER/g' # Replace all numbers with "NUMBER"
$ grep -E '\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Za-z]{2,}\b' # Find email addresses

Regular expressions are like a language unto themselves - cryptic at first glance, but remarkably expressive once you learn their vocabulary. They’re the difference between saying “find me something that looks vaguely like an email” and precisely defining what pattern constitutes a valid email address.


## Practical Exercises

Let’s practice what we’ve learned with some hands-on exercises:

  1. Create a text file with the following content:

    1
    2
    3
    4
    5
    Alice,25,London
    Bob,32,Manchester
    Charlie,45,Birmingham
    Diana,28,Glasgow
    Edward,39,Liverpool
  2. Extract only the names and ages:

    1
    $ cut -d, -f1,2 people.txt
  3. Sort the file by age (numerically):

    1
    $ sort -t, -k2,2n people.txt
  4. Replace all commas with tabs:

    1
    $ cat people.txt | tr ',' '\t'
  5. Add line numbers to the file:

    1
    $ cat -n people.txt
  6. Find all people from cities that contain the letter ‘m’:

    1
    $ grep -i "m" people.txt

Challenge: Create a one-liner that extracts just the city names, sorts them alphabetically, and shows each city only once.

Solution: cut -d, -f3 people.txt | sort | uniq


## Conclusion

Text processing is one of the shell’s greatest strengths. The commands we’ve explored in this chapter provide a powerful toolkit for manipulating, analyzing, and transforming text data of all kinds. While there’s certainly more to learn, mastering these fundamental tools will enable you to handle a vast array of text processing tasks efficiently.

Remember that the real power comes from combining these tools using pipes and redirection. Each command does one thing well, and their strength emerges when you connect them in creative ways to solve complex problems.

In the next chapter, we’ll explore process management – how to control running programs, monitor system resources, and multitask effectively in the shell environment.

Practice these text manipulation techniques, and you’ll soon find yourself processing data with the elegant efficiency of a master chef preparing ingredients - each slice, dice, and transformation bringing you closer to the final desired result.