Saturday, 7 October 2017

Split command in Linux/Unix

Split command is very useful when you are managing large file . Consider you have a csv file with millions of records and its taking too much time to open. In this case we can split file into small pieces and can access it easily in any GUI.


The default size for each split file is 1000 lines, and default PREFIX is "x". However we can split file based of number of lines or bytes and can change the prefix as well. In this article i will show you how to use split command with examples.

Let us consider we have a file testfile.csv with 1342 records.
[~]$ cat testfile.csv | wc -l
1342

1). Split simple example :

As you can see below split command split file testfile.csv in 2 pieces with default prefix x. testfile.csv file having total 1342 records hence by default it split first file name as xaa with default 1000 line and second file name as xab with remaining records 342.
[~]$ split testfile.csv

[~]$ ls
testfile.csv  xaa  xab

[~]$ cat xaa | wc -l
1000

[~]$ cat xab | wc -l
342

2) Split file with specific number of lines:

We can use -l option with split command to achieve specific number of line into split files. Let us we want to split file with 500 records for each then use following command.
[~]$ split -l 500 testfile.csv

[~]$ ls
testfile.csv  xaa  xab  xac

[~]$ cat xaa | wc -l
500

[~]$ cat xab | wc -l
500

[~]$ cat xac | wc -l
342

3) Split file with a specific prefix:

If we want to use our own prefix  "NEW" in split files use the following command.
[~]$ split -l 500 testfile.csv NEW

[~]$ ls
NEWaa  NEWab  NEWac  testfile.csv

4) Split file with numeric suffix:

We can append our own numeric suffix like 00,01,02... instead default xa,xb,xc .... with -d option as following.
[~]$ split -l 50 -d testfile.csv NEW

[~]$ ls
NEW00  NEW02  NEW04  NEW06  NEW08  NEW10  NEW12  NEW14  NEW16  NEW18  NEW20  NEW22  NEW24  NEW26
NEW01  NEW03  NEW05  NEW07  NEW09  NEW11  NEW13  NEW15  NEW17  NEW19  NEW21  NEW23  NEW25  testfile.csv

By default numeric suffix has 2 digits and you may need to increase the number of digits if split files crossing more than 100 files. In that case you will get following "suffixes exhausted" message and you may loose some split files after NEW99.
[~]$ split -l 10 -d testfile.csv NEW
split: output file suffixes exhausted

[~]$ ls
NEW00  NEW05  NEW10  NEW15  NEW20  NEW25  NEW30  NEW35  NEW40  NEW45  NEW50  NEW55  NEW60  NEW65  NEW70  NEW75  NEW80  NEW85  NEW90  NEW95  testfile.csv
NEW01  NEW06  NEW11  NEW16  NEW21  NEW26  NEW31  NEW36  NEW41  NEW46  NEW51  NEW56  NEW61  NEW66  NEW71  NEW76  NEW81  NEW86  NEW91  NEW96
NEW02  NEW07  NEW12  NEW17  NEW22  NEW27  NEW32  NEW37  NEW42  NEW47  NEW52  NEW57  NEW62  NEW67  NEW72  NEW77  NEW82  NEW87  NEW92  NEW97
NEW03  NEW08  NEW13  NEW18  NEW23  NEW28  NEW33  NEW38  NEW43  NEW48  NEW53  NEW58  NEW63  NEW68  NEW73  NEW78  NEW83  NEW88  NEW93  NEW98
NEW04  NEW09  NEW14  NEW19  NEW24  NEW29  NEW34  NEW39  NEW44  NEW49  NEW54  NEW59  NEW64  NEW69  NEW74  NEW79  NEW84  NEW89  NEW94  NEW99

To overcome this you can increase number of digits in suffix by using -a option as following.
[~]$ split -l 10 -a 3 -d testfile.csv NEW

[~]$ ls
NEW000  NEW007  NEW014  NEW021 .........  NEW099  NEW100  NEW101  .........  NEW132

5) Split file with 4000 bytes output:

We can use -b option with desired number of size.
[~]$ split -b4000 testfile.csv
              (or)
[~]$ split -b4k testfile.csv

[~]$ ls -ltr x*
-rw-rw-r-- 1 mukesh mukesh 3888 Oct  7 21:14 xae
-rw-rw-r-- 1 mukesh mukesh 4096 Oct  7 21:14 xad
-rw-rw-r-- 1 mukesh mukesh 4096 Oct  7 21:14 xac
-rw-rw-r-- 1 mukesh mukesh 4096 Oct  7 21:14 xab
-rw-rw-r-- 1 mukesh mukesh 4096 Oct  7 21:14 xaa

6) Split file with 2 files of equal length:

We can use -n option in place of -l as following to achieve specific number of file of same records.
[~]$ split -n 2 -d testfile.csv NEW

[~]$ ls
NEW00  NEW01  testfile.csv

[~]$ cat NEW00 | wc -l
670

[~]$ cat NEW01 | wc -l
672

[~]$ cat testfile.csv | wc -l
1342

In above example the expected count should be 671 into each NEW00 and NEW01 but its not. If anyone could explain me it would be appreciated.

***End***

No comments:

Post a Comment

Related Posts Plugin for WordPress, Blogger...