Distribution Upload HOWTO

From ISP_RAS

Jump to: navigation, search

How To Collect And Upload Distribution Data

There are two ways to collect distribution data:

  1. Using installed distribution.
  2. Using set of rpm or deb packages of which the disribution consists of.

This document concerns the second one only. It is much more preferable, since it doesn't require the distribution to be installed. The first way requires all packages that should be uploaded tobe installed (usually ~1000 packages, though the number varies for different systems); it really takes a lot of time and we don't recommend to follow this way. (However, see step 6 in the algorithm below).

The Data Collection Algorithm

The following steps should be performed to collect data using deb or rpm packages:

  • Call AnalyzePkg.pl script to detect in which packages we are interested in. Provide the script with a path where distribution packages are located using '-p' or '--path' option:
        ./AnalyzePkg.pl -p /path/to/distribution/packages
    In case of deb packages, call the script with '-d' option:
        ./AnalyzePkg.pl -p /path/to/distribution/deb/packages -d
    The script will create directory PKG with files containing analysis results.
    In addition, two files called 'perl_component_list' and 'python_component_list' will be created containing data about perl and python packages.
    Note: One can use mounted media (CD or DVD) as a source of packages, but note that the data collection process will be faster if all packages are located on a hard drive (simply becase file access/copy speed is faster), so it is more preferable to use mounted iso image.
  • Call distrtodb.pl to actually collect the data. In case of rpm packages, the script should be called with '-r' option; for deb packages, '-d' should be used. You should also specify distribution architecture using '-a' option:
        ./distrtodb.pl -r -a x86 DistrName DistrVersion PKG/component_list_sorted 2>distrtodb_errors
    The script takes the three arguments - distribution name, version and list of packages to be uploaded (the latter is created for you automatically by the AnalyzePkg.pl, but you may use your own if needed).
  • NOTE: When collecting data, distrtodb.pl analyzes postinstall scripts and tries to detect modifications to ldconfig search paths (that is, modifications of /etc/ld.so.conf file, or content of different files installed in the /etc/ld.so.conf.d/ directory). The script is not able to process all possible situations; if in doubt, it will print to STDERR 'unrecognized ld/so/conf modification' messages. Ideally, it is desired to analyze all such messages (that's why we redirect STDERR to a separate file - distrtodb_errors - in the example above) and add necessary '!LDpath' records to the data collected. Examples of these records can be found in any file with distribution data.
  • If everything is ok, a file with distribution data should be created. The file is named '<DistrName>_<DistrVersion>_<Arch>_upload_data' by default.
    Before proceeding to the next step, rename this file or save it to some other location:
        mv <DistrName>_<DistrVersion>_<Arch>_upload_data <DistrName>_<DistrVersion>_<Arch>_upload_data_main
  • Now let's collect data about interpreted languages. Due to algorithms used in the upload tools, we cannot place this data in file created as the result of step 2. That's why we have renamed that file in the step 3 - otherwise it will be overwritten in the next steps.
    All that we should do now is to call distrtodb.pl once more with another input files:
        ./distrtodb.pl -r -a x86 DistrName DistrVersion perl_component_list
    Again, '<DistrName>_<DistrVersion>_<Arch>_upload_data' will be created. We recommend to rename it to more sensible name:
        mv <DistrName>_<DistrVersion>_<Arch>_upload_data <DistrName>_<DistrVersion>_<Arch>_upload_data_perl
    Note: Do NOT merge this file with the file created in the step 2.
    And now python:
        ./distrtodb.pl -r -a x86 DistrName DistrVersion python_component_list
    mv <DistrName>_<DistrVersion>_<Arch>_upload_data <DistrName>_<DistrVersion>_<Arch>_upload_data_python
    We don't recommend to merge data for different languages, since we use different collection algorithm for every language; in case if algorithm is modified, the data should be recollected. If you place data for different languages in one file, then it is likely that you'll have to recollect data for all languages.
  • (OPTIONAL). There is some built-in python modules whose presence can be checked only in the installed system (with installed python, of course). The absence of information on this modules in the database is not considered as a critical issue at the moment, so this step is optional.
    To obtain this data, one should execute distrtodb in the installed system with '--dump_python' option:
        touch dummy_list && ./distrtodb.pl --dump_python -r -a x86 DistrName DistrVersion dummy_list && rm dummy_list
  • Let's now collect Java data. First, one should detect which Java machines are provided with the distribution. It is strongly recommended that all packages that belong to the same Java machine form one distribution component. Let's suppose that we have Sun Java consisting of packages jre.rpm and jdk.rpm and IBM Java consisting of ibm_pack1.rpm and ibm_pack2.rpm. Let's create a file named 'java_list' with the following contents:
    SunJava jre.rpm
    SunJava jdk.rpm
    IBMJava ibm_pack1.rpm
    IBMJava ibm_pack2.rpm
    Note: Component names should not contain spaces.
    When the file is ready, one should call 'distrtodb.pl' with '--java' option:
        ./distrtodb --java -r -a x86 DistrName DistrVersion java_list
        mv <DistrName>_<DistrVersion>_<Arch>_upload_data <DistrName>_<DistrVersion>_<Arch>_upload_data_java
    Now we have a file with data about Java virtual machines in our distribution.
    Note: Java analysis is rather time-consuming - java analyzers are currently much more slower than readelf.
  • Now we have several '*upload_data' files. This is the final result of data collection process - now these files should be processed by the upload_distr_data.pl script.
  • In order to reduce the size of data stored in the database, we use 'aliases' for Components. Two components are called 'aliases' if they have exactly the same content (libraries, commands, interfaces, etc.). If two components are found to be aliases, there is no need to store information about content of every component; we store this data only for one of them (let's say with Cid=Cid1), and for others we set Calias field to Cid1.

Note that there should be no 'aliases for aliases', e.g. if some db record has Calias=Cid1, then Component record with Cid=Cid1 should have Calias=0.

Aliases can be detected automatically either by perl script (detect_distr_comp_aliases.pl) or by corresponding MySQL stored procedure (detect_aliases_for_distr). Both script and procedure take Distribution.Did value as a parameter. The script produces a set of SQL instructions that should be then applied to the db, the procedure performs all necessary actions 'in place'.

Alternatively, you can specify '--alias' option for upload_distr_data.pl.

Note also that most 64bit distributions provide both 32bit and 64bit versions of some libraries. Sometimes these versions are provided by packages that differ only by architecture suffix (i.e. libfoo.i586.rpm and libfoo.x86-64.rpm). During the data collection/upload, such packages are joined into one component (named libfoo). However, it can be useful to divide this component into 32- and 64-bit parts, since it is very likely that 32bit part will be an alias for the same package from 32bit version of the distribution. In order to perform this, we have separate scripts (detect_biarch_comps*.pl). We suggest to run them before setting up aliases.


How does this work?

Now some words about algorithms used by the data collection tools.

AnalyzePkg.pl script detects which packages provide data we are interested in. We don't collect data about all distribution components; instead, we a list of so libraries sonames stored in the 'library_list' file and for every distribution we collect data about components that provide libraries with such sonames. The list is created on the basis of application needs - if we see that applications use some library, we should add such library in this list.

Note: The list of libraries implements 'approved libraries' concept; however, ApprovedLibrary table in the database and the 'library_list' file have different meanings. 'library_list' should contain all sonames from the ApprovedLibrary table, but not vice versa. For entries from the ApprovedLibrary we guarantee data completeness - i.e. if some soname is present in this table, users can be sure that we have looked for libraries with such soname in every uploaded distribution. So if one adds a new entry in the 'library_list' file, he should collect data about this new library for all distributions which are already present in the database. Only after that the entry can be added to the ApprovedLibrary table.

We also have the same list of commands stored in the 'command_list' file. This is simply a list if all commands included in the LSB (to say he truth. there are some more relative commands in this list).

We don't have any analogues of 'approved' lists for perl or python; we simply take all files that have 'perl' or 'py' in their names, ignoring some known exceptions (stored in the ignore_perl and ignore_python files correspondingly).

Detailed description of distrtodb.pl can be found in the README_collect file or on the Collect Distribution Information page.

Personal tools