Tabs or spaces for indentation? Statistics on 3.8 million Perl files created in 24 years

One of the eternal questions in programming — which characters to use in program code for indentation — tabs or spaces.

Sometimes there is no choice. For example, in Makefile be sure to use tabs. In the language programming go there is an official utility gofmt code that formats and this utility uses tabs for indenting. B esoteric programming language Whitespace tabs and spaces can not replace each other. But many programming languages do not impose a choice, and allow the programmer to decide which characters to use.

There is quite popular opinion of which characters to use for indenting. The opinion is the following: it does not matter whether to use most importantly consistency. If you use tabs, you always need to use them. If use spaces, use only spaces and never use tabs.

There are even comic comic comic on this subject:

(two people completely disagree with each other whether to use tabs or spaces, but absolutely agree that it is necessary to use only one):

How are things in the real world? What is actually used?

It's easy enough to find out. Need to take source codes of programs, to calculate what characters are used and look at the results.

This article is the result of a little research about the use of tabs and spaces in the world of Perl programming language. There is a huge repository which stores Perl libraries CPAN. I downloaded all the versions of all libraries which are now on CPAN (there were about 135 thousand) and decided which characters are used for indentation.

Before you read any further, I suggest you a minute to think and try to assume that popular for indentation:

  • Taba
  • gaps
  • or a mixture of tabs and spaces

?

Writing code

So, the challenge is clear. You need to take all libraries from CPAN and check out what used to indents.

First you need to download the whole CPAN. This is done with one command:

time /usr/bin/rsync -av --delete cpan-rsync.perl.org::CPAN /project/CPAN/

3 hours and downloaded CPAN. It takes up about 27 GB.

CPAN is a collection of files organized in a certain structure. Here's a snippet:

CPAN/authors/id
├── A
│   ├── AA
│   │   ├── AADLER
│   │   │   ├── CHECKSUMS
│   │   │   ├── Games-LogicPuzzle-0.10.readme
│   │   │   ├── Games-LogicPuzzle-0.10.tar.gz
│   │   │   ├── Games-LogicPuzzle-0.12.readme
│   │   │   ├── Games-LogicPuzzle-0.12.tar.gz

In this example, AADLER is the login of the author, and Games-LogicPuzzle-0.10.tar.gz and Games-LogicPuzzle-0.12.tar.gz is releases.

Now on CPAN there are more than 7 thousands of authors who have downloaded the libraries on CPAN. In order not to store all 7 the thousands of folders in the same folder, added a few more levels (version control system git stores its data in a similar way).

On CPAN, you can download library that is Packed with different archivers.

I started with what counted the number of different file extension in the folder CPAN/authors/id/. Here is the script and the result of his work . Top extension of the archives:

  • .tar.gz 135571
  • .tgz 903
  • .zip 652
  • .gz 612
  • .bz2 243

.tar.gz wins with such a margin that I decided that it would be enough to count what characters are used to indent only the libraries that are Packed in .tar.gz

Then I wrote a few scripts. Initially, I did not end up was clear in what I want to get data about tabs and spaces, so I decided to make a system consisting of several components. First, pre-process all 135 thousand files with the releases and put information about the tabs and spaces into a database. Expect it will be long. And then to use the data from the database in order to quickly obtain data in different formats.

In the end, the script fill_db . This script filled in the data base a little more than five hours. But these five o'clock is when everything has been debugged. Not the first time the script worked. The main problem was with Unicode. First there was the problem with the release μ-0.01.tar.gz author APEIRON, then there were problems with the file t/words_with_ß.dat from the release Lingua-DE-ASCII-0.06 author BIGJ. But in the end all problems were solved and the script successfully made it through all .tar.gz releases.

The script is all .tar.gz files in CPAN. Unpacks .tar.gz the temporary folder. Finds in this temporary folder all the files whose extensions .pm, .pl, .t or .pod, reads all the indents and checks if the indentation spaces and or tabs. In the releases there are other files, but I decided to be limited only to files that clearly relate to Perl.

The result of this script is 2 tables in the database. Here is an example of the data:

mysql> select * from releases limit 1;
+------------+--------+---------------------------------------------------------------+------------+
| release_id | author | file_name                                                     | timestamp  |
+------------+--------+---------------------------------------------------------------+------------+
|          1 | RUFF   | /cpan/authors/id/R/RU/RUFF/DJabberd-Authen-Dovecot-0.1.tar.gz | 1359325895 |
+------------+--------+---------------------------------------------------------------+------------+
1 row in set (0.00 sec)

mysql> select * from files where release_id = 1;
+---------+------------+--------------------------------------------------------+------+---------------------+-------------------+
| file_id | release_id | file_name                                              | size | has_space_beginning | has_tab_beginning |
+---------+------------+--------------------------------------------------------+------+---------------------+-------------------+
|       1 |          1 | DJabberd-Authen-Dovecot/lib/DJabberd/Authen/Dovecot.pm | 2047 |                   1 |                 1 |
|       2 |          1 | DJabberd-Authen-Dovecot/t/compiles.t                   |   64 |                   0 |                 0 |
+---------+------------+--------------------------------------------------------+------+---------------------+-------------------+
2 rows in set (0.02 sec)

mysql> mysql> selec(*) from releases;
+----------+
| count(*) |
+----------+
|   135343 |
+----------+
1 row in set (0.04 sec)

mysql> select count(*) from files;
+----------+
| count(*) |
+----------+
|  3828079 |
+----------+
1 row in set (5.71 sec)

Only spaces, only tabs, tabs and spaces, and...

Total in the database about each file in the release there is 2 flag:

  • do you use spaces to indent
  • do you use tabs in indentation

Respectively of the two flags can be 4 combinations:

  • 11 — used spaces and tabs
  • 10 — use only spaces
  • 01 — used only tabs
  • 00 — not used spaces or tabs

The first three options is a completely expected situation, that I wanted to find and to find out what is popular. But the option 00 — "don't use neither tabs nor spaces" — that's what I was thinking but it turned out that this is happening. "How?" — you will ask. Here is an example.

mysql> select releases.release_id, files.file_name, files.size, has_space_beginning, has_tab_beginning from releases join files on releases.release_id = files.release_id and author = 'KOHA';
+------------+---------------------------------------------------+------+---------------------+-------------------+
| release_id | file_name                                         | size | has_space_beginning | has_tab_beginning |
+------------+---------------------------------------------------+------+---------------------+-------------------+
|     118147 | Bundle-KohaSupport-0.31/lib/Bundle/KohaSupport.pm | 2169 |                   0 |                 0 |
|     118147 | Bundle-KohaSupport-0.31/t/Bundle-KohaSupport.t    |  487 |                   0 |                 0 |
|     118147 | Bundle-KohaSupport-0.31/t/pod.t                   |  130 |                   0 |                 0 |
+------------+---------------------------------------------------+------+---------------------+-------------------+
3 rows in set (0.05 sec)

The author KOHA has a release Bundle-KohaSupport-0.31. In this release there are 3 files which have extensions from the list .pm, .pl, .t or .pod. About all these files in database written in their indents, no spaces, no tabs. How can this be?

It turns out that all elementary. If you look at these files, they just simply there is no indentation. For example, the contents of the file t/Bundle-KohaSupport.t:

# Before `make install' is performed this script should be runnable with
# `make test'. After `make install' it should work as `perl Bundle-KohaSupport.t'

#########################

# change 'tests => 1' to 'tests => last_test_to_print';

use Test::More tests => 1;
BEGIN { use_ok('Bundle::KohaSupport') };

#########################

# Insert your test code below, the Test::More module is use()ed here so read
# its man page ( perldoc Test::More ) for help writing this test script.

So in addition to the three unexpected situations:

  • use only spaces
  • used only tabs
  • are used, and spaces and tabs

also is the situation:

  • do not use any spaces and do not use tabs

Data authors

After I have processed data in the database, I decided to watch from each author he uses for padding.

I expect that the most popular will be the use of only spaces, the second place by popularity will be using only tabs, and the third place in popularity is the simultaneous use of tabs and spaces.

But it turned out that I was completely wrong.

I wrote the script . This script checked what characters are used by the authors to all files .pm, .pl, .t, .pod, which is in all their releases which is now on CPAN.

Here's what happened:

$ cat app/data/users.log | perl -nalE 'say if /^##/'
## 00 (nothing) - 50 (0.7%)
## 01 (only tabs) - 51 (0.7%)
## 10 (only spaces) - 1543 (21.9%)
## 11 (both) - 5410 (76.7%)

Data is absolutely not such as I expected!

  • More than 75% of the authors use a mixture of spaces and tabs for indenting.
  • Only spaces in second place, slightly more than 20%,
  • and the authors who use only tabs less than percent.
  • The number of authors who do not use the padding is almost the same as the number of authors who use only tabs.

Full list of all authors in the breakdown of the Pro teams there are in the file on GitHub .

But jupyter notebook  by which was built the pie chart.

But this data is generated in all releases, which is now on CPAN. These releases was created over the last 24 years. May be with time the ratio somehow changing?

Time

Every file release on CPAN, the modification time is the time when this release was uploaded to CPAN. These data are loaded in the database. Now the old CPAN release is Ioctl-0.5 — it was uploaded to CPAN 1995-08-20:

mysql> select author, file_name, from_unixtime(timestamp) from releases where timestamp = (select min(timestamp) from releases);
+--------+----------------------------------------------+--------------------------+
| author | file_name                                    | from_unixtime(timestamp) |
+--------+----------------------------------------------+--------------------------+
| KJALB  | /cpan/authors/id/K/KJ/KJALB/Ioctl-0.5.tar.gz | 1995-08-20 07:26:09      |
+--------+----------------------------------------------+--------------------------+
1 row in set (0.08 sec)

And this day was filled from 8 releases:

mysql> select * from releases where from_unixtime(timestamp) < '1995-08-21' order by timestamp;
+------------+--------+--------------------------------------------------------------+-----------+
| release_id | author | file_name                                                    | timestamp |
+------------+--------+--------------------------------------------------------------+-----------+
|     112505 | KJALB  | /cpan/authors/id/K/KJ/KJALB/Ioctl-0.5.tar.gz                 | 808903569 |
|      23026 | TYEMQ  | /cpan/authors/id/T/TY/TYEMQ/FileKGlob.tar.gz                 | 808903636 |
|     134031 | WPS    | /cpan/authors/id/W/WP/WPS/Curses-a8.tar.gz                   | 808903647 |
|     112546 | KJALB  | /cpan/authors/id/K/KJ/KJALB/Term-Info-1.0.tar.gz             | 808903748 |
|      70278 | MICB   | /cpan/authors/id/M/MI/MICB/TclTk-b1.tar.gz                   | 808910379 |
|      70274 | MICB   | /cpan/authors/id/M/MI/MICB/Tcl-b1.tar.gz                     | 808910514 |
|      19408 | GBOSS  | /cpan/authors/id/G/GB/GBOSS/perl_archie.1.5.tar.gz           | 808930091 |
|      81551 | JKAST  | /cpan/authors/id/J/JK/JKAST/StatisticsDescriptive-1.1.tar.gz | 808950837 |
+------------+--------+--------------------------------------------------------------+-----------+
8 rows in set (0.06 sec)

I decided to see how the distribution of the use of different characters for indentation at the time. For this I wrote the script .

Here's a snippet of data files who created the script:

$ cat app/data/releases_date.csv | head
date,00,01,10,11
1995-08-20,0,1,0,7
1995-08-21,0,0,0,0
1995-08-22,0,0,0,0
1995-08-23,0,0,0,0
1995-08-24,0,0,0,1
1995-08-25,0,0,0,0
1995-08-26,0,0,0,0
1995-08-27,0,0,0,0
1995-08-28,0,0,0,0

Ie for each date, starting with 1995-08-20 provides information about how much has been releases by the fact what characters were used for indenting.

  • 00 — in the indentation, no spaces, no tabs
  • 01 — in indentation is used only tabs
  • 10 — to indent use only spaces
  • 11 — in Atsuta used tabs and spaces

Then I wrote jupyter notebook  which drew the graph. On the chart I display the absolute number of releases by type of indentation, and the percentage of the total number of releases on this day:

The graph shows almost 9 thousand days. It is evident that there is a trend, but the graph is noisy and it is bad you can see everything. Because instead of the days I was grouped releases in months.:

But there is a surprising trend. The number of releases which use only tabs or no indents virtually unchanged, but the proportion of releases which use only spaces constantly growing and this growth is due to the proportion of releases which use a mixture of tabs and spaces.

Why is "only spaces". Hypothesis number 1

I looked at the data and I had one hypothesis of why decreasing the number of releases which use and problems tabs. My thought about Perl library Module::Install . If when writing your library use Module::Install, the CPAN release included files from this library. And in these files uses a mixture of spaces and tabs. Here is an example file from a Module::Install release Devel-PeekPoke-0.04:

mysql> select * from files where release_id = 284 and file_name like '%inc/Module/Install%';
+---------+------------+----------------------------------------------------+-------+---------------------+-------------------+
| file_id | release_id | file_name                                          | size  | has_space_beginning | has_tab_beginning |
+---------+------------+----------------------------------------------------+-------+---------------------+-------------------+
|   10328 |        284 | Devel-PeekPoke-0.04/inc/Module/Install.pm          | 12381 |                   1 |                 1 |
|   10329 |        284 | Devel-PeekPoke-0.04/inc/Module/Install/Metadata.pm | 18111 |                   1 |                 1 |
|   10330 |        284 | Devel-PeekPoke-0.04/inc/Module/Install/Fetch.pm    |  2455 |                   1 |                 1 |
|   10331 |        284 | Devel-PeekPoke-0.04/inc/Module/Install/Makefile.pm | 12063 |                   1 |                 1 |
|   10332 |        284 | Devel-PeekPoke-0.04/inc/Module/Install/Base.pm     |  1127 |                   0 |                 1 |
|   10333 |        284 | Devel-PeekPoke-0.04/inc/Module/Install/WriteAll.pm |  1278 |                   0 |                 1 |
|   10334 |        284 | Devel-PeekPoke-0.04/inc/Module/Install/Win32.pm    |  1795 |                   1 |                 1 |
|   10335 |        284 | Devel-PeekPoke-0.04/inc/Module/Install/Can.pm      |  3183 |                   1 |                 1 |
+---------+------------+----------------------------------------------------+-------+---------------------+-------------------+
8 rows in set (0.03 sec)

My hypothesis is that developers use spaces for indentation, but due to the fact that in the release is Module::Install that are recorded in the statistics, and spaces and tabs. Module::Install steel less to use (as there were all sorts of Dist::Zilla, Dist::Milla, Minilla), and therefore Module::Install stopped giving distortion.

This hypothesis needs to be checked. First I decided to see if Module::Install being used less and less. I built a schedule. Each point is the number of releases for the month in which used Module::Install. You can see that the part of hypothesis is correct — indeed, Module::Install steel use less.

But whether the use of Module::Install affects the utilization of spaces or tabs and spaces for indents. In order to find out, I drew two graphics. Is the number of different types of padding releases months. The first chart only releases which use Module::Install, on the second chart only releases which are not in use.

Here you can see that indeed, if you use a library Module::Install, most often the library is used it is a mixture of tabs and spaces.

And here's a chart which displays only those releases which do not use Module::Install. If you compare this schedule with the schedule which apply to all releases, then there is a difference, but nothing changes.

It turns out that the hypothesis is wrong. If the release uses Module::Install, then the release often falls into the group "tabs and spaces", but if you do not account for all releases which use Module::Install, you still have a trend the proportion of releases which use only tabs as indentation increases at the expense of the proportion of releases which are used a mixture of tabs and spaces.

Why is "only spaces". Hypothesis number 2

Why are all the same, a growing number of releases which use only tabs? There may be some excess an active author, producing many releases and the author have that effect on all the statistics?

I tried to check it out. Drew a graph which shows the share of the releases in which use only the spaces, but by first letter of author's name. If indeed some the author made over a large contribution to the overall statistics, that kind of line very sharply went up. On the chart I saw, all lines are plus or minus even. So confirm this hypothesis, I do not was able to get.

Why is "only spaces". Hypothesis number 3

The graphs show that over time becomes more and more releases which are only spaces for indentation. And this share is growing at the expense of the releases in which is a mixture of spaces and tabs.

My first assumption was that this happens due to the fact that in releases before actively included code libraries Module::Install which used a mixture of spaces and tabs, this library use smaller, and therefore, the proportion of releases which use a mixture of tabs and gaps reduced. Was that part of truth in this, but even if we remove from review all releases which use Module::Install, the overall trend does not change — still, the share of the releases in which only the gaps are growing at the expense of the share of the releases in which uses a mixture of spaces and tabs.

My second assumption is that the influence on the statistics of a very small set of very active authors. I couldn't find confirmation of this hypothesis.

My third hypothesis is that the authors appear more convenient text editors and IDE, which is easier to use grease only spaces, not a mixture of spaces and tabs. But, unfortunately, ideas how to test this hypothesis I have. Data which lie on CPAN there is no information about what the editor was used to create this release. I looked at the release dates for popular editors/IDE:

  • Emacs — 1985
  • vim — 1991
  • IntelliJ IDEA — januaray 2001
  • Eclipse — November 2001
  • Sublime Text — January 2008
  • Atom — February 2014
  • VS Code — April 2015

Data for 2019

On the previous graph shows that over time becomes more and more releases which are spaces and not mix tabs with spaces. So I decided to look at the distribution of what types of margins used by writers only on account of their releases in 2019.

Data from the results of running the script :

$ cat app/data/users_2019.log | perl -nalE 'say if /^##/'
## 00 (nothing) - 12 (1.4%)
## 01 (only tabs) - 9 (1.0%)
## 10 (only spaces) - 355 (41.2%)
## 11 (both) - 486 (56.4%)

If we compare the data for 2019 and provide data for all years, we see that the percentage of authors that uses only tabs does not change, but the proportion of authors who use only spaces has increased dramatically.

The source for this pie chart:

Factors affecting the validity of the data

For the formation of numbers and graphs were all used .tar.gz releases that was on CPAN at the time of this writing, in addition to releases the the Perl programming language.

CPAN allows you to delete the releases in the data shown in this article remote releases did not participate. It is unclear how much will change if the data consider the characters of padding already removed the releases. It is possible that the data will change much. There is an archive backpan  which stores all the releases, who has ever been on CPAN. So in theory there is a possibility to convert all numbers based on the releases which is not on CPAN.

The second point, which affects data accuracy is something that was taken into account symbols indentation only releases that were Packed in .tar.gz archive. Other types the archives were not used. The vast majority of releases are .tar.gz so it made such an assumption. If the count data for all archives data will surely change. Assume that the change will be more than a few percent.

Source code

The whole set of scripts that were used to collect data, the data itself and jupyter notebooks are all available in the repository on GitHub.

Code which is written — it's just very far from perfect. All that was written written with ideas as quickly as possible to get the result, not to create perfect code.

Summary

At the time of this writing the repository of Perl CPAN libraries, there were about 135 thousand releases. The first release was made 24 years ago (1995-08-20). In these releases is almost 4 million files extensions .pm, .pl, .t or .pod.

If we consider data for all the time, it turns out that 76.7%% of authors in the margins use a mixture of spaces and tabs, 21.9% used in the indentation, only spaces, and 0.7% — only tabs.

But if we consider data only for 2019, it becomes more and more authors who use only spaces for indentation, but still the majority uses a mix of tabs and spaces (56.4% — use tabs and spaces,spaces 41.2% — only gaps, 1.0% — only tabs).

And if you look at the graph of percentage change of using different types of indents, you can see that the share of use only spaces is growing and this share is growing at the expense of the share of those who uses a mixture of tabs and spaces for indentation.

It is not known why this percentage is increasing. It is possible that this is due to the fact that the authors use a more convenient text editors that make it easier and safer to install which characters to use for indenting.

Other articles

Comments