I recently needed to process a large text file (25GB). The file was basically a "csv" file, containing many rows of delimited data fields. For various reasons it was easier for me to process the file in smaller chunks, like maybe 5GB chunks. I was about to write a C# or Python script to split up the file, but then I decided to search for an existing solution. Here is what I found.
Most Linux distributions contain two handy utilities:
However these commands don't work directly on Windows, so you need a Linux environment or a Linux-like environment to run the commands.
Using Ubuntu on Windows Subsystem for Linux (WSL) is certainly an option, but this seemed like a bit more overhead than I was in the mood for, as it required me to run Ubuntu, remember the commands to change directory to my local windows directory (note: cd /
then cd /mnt/c/data
) etc. In hindsight not a bit deal, but I wanted something quick!
I've been using Git quite a bit lately for a couple of projects, so when I was searched for solution for to the above problem a stackoverflow link indicated that the commands could just be ran in "Git Bash", which was new to me.
Atlassian defines Git Bash as the following: "Git Bash is an application for Microsoft Windows environments which provides an emulation layer for a Git command line experience. Bash is an acronym for Bourne Again Shell. A shell is a terminal application used to interface with an operating system through written commands. Bash is a popular default shell on Linux and macOS. Git Bash is a package that installs Bash, some common bash utilities, and Git on a Windows operating system."
So while this isn't a full Linux environment like running Ubuntu in WSL, it's a way to run some of the common Linux utilities on Windows.
With Git installed, just enter "Git Bash" in the Windows 10/11 search bar and select the Git Bash App.
An application that looks like the command prompt utility will open, providing access to the Bash shell and utilities.
I downloaded a quick text file for the purposes of this article: sample-text-file.txt
I first ran this command to see how many lines were in the file:
wc -l < sample-text-file.txt
There are 3 lines in the sample. My 25GB file had 83 million lines. Note that since the last line of the sample file doesn't end in a line feed character, wc only counts 3 lines.
Let's split the sample file into 1 line chunks using the following command:
split sample-text-file.txt -l 1
The result is 4 new text files in the same directory as the sample file name "xaa", "xab", xac", and "xad" each containing 1 line from the sample file. I split my 25 GB file in 15 million line chunks, which resulted in 6 files each approximately 4.5GB in size.
Git Bash is a quick way to get access to the Bash shell and many common Linux utilities such as wc and split. This is especially handy if you don't have Windows Subsystem for Linux (WSL) installed.