Data inputs and fits ======================= As for now we have considered only how to construct theoretical model of an experiment. To carry out an analysis, at least two more pieces are required: experimental data and uncertainties. Since no real analysis was ever done inside the framework yet, the means to work with that kind of inputs are rather rudimentary, not well-thought for the real usage and not even tested. So, be ready and don't hesitate to rewrite or extend anything. The basic structure of the current design is the following: - experimental data and uncertainties are just a numerical arrays in the form of output of a computational graph; even in the most simple case with constant values a computation node (e.g. ``Points`` class) should be constructed; in more complicated cases some intermediate transformations may be performed (like unfolding, extrapolation, etc) to get the data to analyze; - experimental data and uncertainties are associated with a prediction and should have exactly the same shape; this is clearly targeted only to weighted least-squares analysis of binned counting data; - to make analysis global, a covariance is defined between pair of experimental data by using the properly shaped covariance matrix; theoretical covariance is implemented by having common parameters between different experiments; - the data, uncertainties and covariances associated with the corresponding observables form a *dataset* and are stored inside a ``gna.Dataset`` object; there may be many datasets for the same observations in arbitrary combinations; - to analyze a set of specified observables they are concatenated into one linear sequence, as well as the data from the provided datasets; covariance matrices are constructed for each subset of non-covariated observables; - a statistic is constructed from all that subsets that is later passed to minimization or scanning procedure. Let's try to see how all of that works on some simple examples. First example is to try to do simple fit of our Gaussian peak model to hypothetical data, which we'll get from predictions with known parameters and Poisson fluctuations. To get a significant event number, let's set ``BackgroundRate`` to 10, ``Mu`` to 30 (just arbitrary numbers for illustration) and bin numbers to 10, leaving everything else default: .. code-block:: ipython python ./gna gaussianpeak --name peak --nbins 10 -- ns --value peak.BackgroundRate 10 --value peak.Mu 30 -- repl ... In [1]: self.env.get('peak/spectrum').data() Out[1]: array([ 5.08330775, 11.62487551, 11.62487551, 5.08330775, 5.00000384, 5. , 5. , 5. , 5. , 5. ]) In [2]: data = np.random.poisson(self.env.get('peak/spectrum').data()) In [3]: data Out[3]: array([ 3, 11, 14, 2, 4, 3, 8, 3, 5, 1]) In [4]: np.savez("/tmp/peakdata.npz", data=data) Here we generate average theoretical spectrum for the specified parameters and used ``np.random.poisson`` to make a Poisson random sample out of it (of course you'll get different values, since it's random). The data is saved in the .npz file ``/tmp/peakdata.npz``, which we'll load during analysis. Now we should forget about the parameters value, which we used to generate a sample and use our estimation. In our toy example we'll just say that we theoretically estimated ``BackgroundRate`` to be 14 with uncertainty of 30% and we want to find ``Mu`` by fitting. So, our command will start with the model initialization: .. code-block:: bash python ./gna ns --define peak.BackgroundRate central=14 relsigma=0.3 -- gaussianpeak --name peak --nbins 10 To load the data we can use repl or put the following commands to a file (for example ``scripts/gaussianpeak_data``): .. literalinclude:: examples/gaussianpeak_data.py :language: py :linenos: Here we just read the data from the created npz file to variable ``x`` (line 4), create a new dataset with given description (line 5), then assign ``x`` to be experimental values of the observable ``peak.spectrum``, and since the data is Poissonian, the same ``x`` goes as the experimental uncertainty (:math:`\sigma^2`), third argumnet of assign). Finally, the dataset is stored in ``env`` under the name ``peak_fakedata`` for later usage. Ths script should be executed by using the subcommand after ``gaussianpeak`` initialization: .. code-block:: bash -- script scripts/gaussianpeak_data Once we have created dataset and we are going to finalize inputs for our analysis with the following command: .. code-block:: bash -- analysis --name first_analysis --datasets peak_fakedata --observables peak/spectrum The name of the analysis, which will be used later, is specified, along with the datasets, whose data, uncertainties and covariance information will be used and finally the observables that we are going to analyze. To turn the analysis into optimization problem, we need to use some statistic. The only implemented for now the chi-squared (weighted least squares) statistic. To use it, the following command should be applied: .. code-block:: bash -- chi2 first_analysis_chi2 first_analysis where the name of the statistic (``first_analysis_chi2``) and the corresponding analysis object to build statistic for (``first_analysis``) are specified. We are going to minimize our statistic, so minimizer should be created: .. code-block:: bash -- minimizer first_analysis_minimizer minuit first_analysis_chi2 peak.Mu peak.BackgroundRate The arguments are the following: minimizer name, minimizer type (only ``minuit`` is implemented, check modules in the ``gna.minimizers`` package, then statistic for minimization and finally the list of parameters to minimize over. Here we do two-dimensional minimization over ``Mu`` and ``BackgroundRate``, all the other parameters are kept fixed in their default values. Now, the last command is to issue the minimization process: .. code-block:: bash -- fit first_analysis_minimizer Try to concatenate all commands into one string and run to get the following result: .. code-block:: bash Namespace(cpu=0.0068070000000002295, errors=array([ 11.56782544, 4.35315588]), fun=8.597269993657244, maxcv=0.01, nfev=35L, success=True, wall=0.006793022155761719, x=array([ 44.06893665, 5.1389115 ])) This result is constructed directly from values returned by MINUIT, the most important here are the following fields: - ``success=True`` -- minimization procedure (MIGRAD) has converged, - ``fun=8.597269993657244`` -- statistic value in found minimum, - ``x=array([ 44.06893665, 5.1389115 ])`` -- minimizing parameter values, in the same order as passed to the ``minimizer`` command, - ``errors=array([ 11.56782544, 4.35315588])`` -- the corresponding uncertainties estimated by MINUIT. We have reconstructed our parameters as ``Mu = 44.06893665 +- 11.56782544`` and ``BackgroundRate = 5.1389115 +- 4.35315588``. Not very precise, but that's what we have. It's always useful (when possible) to see the profile of the statistic in form of the contour. To get one, we first need to scan the statistic values over a set of points (i. e. grid). In our example we can do two dimensional scan. First, we need to remove the parameters which we are going to use in grid from the ``minimizer`` command: .. code-block:: bash -- minimizer first_analysis_minimizer minuit No parameters left, but that's fine. This minimizer does not actually minimize anything just returning the statistic value in the current point, but we need it to pass to the ``scan`` command: .. code-block:: bash -- scan --grid peak.Mu 0 150 0.5 --grid peak.BackgroundRate 0 30 0.5 --minimizer first_analysis_minimizer --verbose --output /tmp/peak_scan.hdf5 First two arguments define the grids over two parameters. Few types of grids can be used: .. code-block:: bash - ``--grid parameter initial-value points-count step-size`` -- constant step grid specified by initial value, points count and step size; - ``--lingrid parameter initial-value final-value points-count`` -- constant step grid specified by initial value, final value and total points count including initial and final; - ``--loggrid parameter initial-value final-value points-count`` -- constant step logarithmic (base 10) grid ``--lingrid``; - ``--listgrid parameter value1 value2 ...`` -- arbitrary spaced grid given just by any number of points. All the grids parameter may be intermixed, the final grid will be cartesian product over unions of grids for each parameter. We also specify ``--minimizer``, which now just returns the statistic value, the ``--verbose`` flag that's not required but will print the values for each point during the scan and finally the output file. When scan will done its work, you'll have the output file ``/tmp/peak_scan.hdf5``. You can explore it with ``hdf-java`` or ``h5dump``. It has so-called point-tree structure: for :math:`n` parameters the first :math:`n` levels of hierarchy represent the parameters value, the name of the :math:`i`-th level group is the value of the :math:`i`-th parameter. The ordered list of the parameters in stored in the ``params`` root attribute. On the :math:`n+1`-th level we have the actual data -- the contents will depend on concrete scan type, but generally there is ``datafit`` dataset where result for each specified minimizer is stored, ``fcparams`` where the minimizing parameters are stored (in the order of the ``allparams`` root attribute). Just take a look and you'll see. Finally let's plot the contour. To do that you need to get back the minimizer with all parameters (it will be used to find the global minimum) and issue the following command: .. code-block:: bash -- contour --chi2 /tmp/peak_scan.hdf5 --plot chi2ci 1s 2s --minimizer first_analysis_minimizer --show The first argument is a path to the chi2 values map, the second specifies what we are going to plot -- `'chi2ci`` means chi-squared confidence interval and this corresponds to the p-value given by the :math:`\chi^2(n)` distribution. ``1s`` and ``2s`` means one and two sigma, this will be mapped to the p-value for the two-tailed Gaussian. The ``--show`` arguments asks ``matplotlib`` to show the plot on the screen. If you like the plot, you can save it to a file with the ``--output path`` argument In our two-dimensional fit we actually left the ``BackgroundRate`` completely free, no information about it's uncertainty was used. If we want to take it into account we have two options, either add it ass pull (punishment term) or add the corresponding covariation in the covariation matrix. The pull parameters in the framework are handled just like any real experimental data -- the observable for them is the parameter value itself, the data is the central value and the uncertainty is sigma, specified in the parameter definition. Therefore, to add punishment terms to the fit, we need to: 1. Create a new dataset for pull parameters; there is short command for that (check the code): .. code-block:: bash -- dataset --name pulls --pull peak.BackgroundRate Here just add parameter added to the dataset ``pulls`` (the name is arbitrary), any number can be specified. 2. Add the created dataset to the ``--datasets`` argument of the ``analysis`` command; 3. Add all the pull parameters to the ``--observations`` of the ``analysis`` command; Try to construct the command yourself. Or here is the complete one for reference: .. code-block:: bash python ./gna ns --define peak.BackgroundRate central=14 relsigma=0.3 \ -- gaussianpeak --name peak --nbins 10 \ -- script scripts/gaussianpeak_data \ -- dataset --name pulls --pull peak.BackgroundRate \ -- analysis --name first_analysis --dataset peak_fakedata pulls \ --observables peak/spectrum peak.BackgroundRate \ -- chi2 first_analysis_chi2 first_analysis \ -- minimizer first_analysis_minimizer minuit first_analysis_chi2 peak.Mu \ peak.BackgroundRate \ -- fit first_analysis_minimizer It gives the following result: .. code-block:: bash Namespace(cpu=0.0049440000000000595, errors=array([ 11.54509995, 1.10426956]), fun=12.74075442086563, maxcv=0.01, nfev=31L, success=True, wall=0.004951953887939453, x=array([ 42.59368222, 5.75145748])) Just a bit better then the previous one, but that's expected since the uncertainty is just 30\%. Alternatively, you can follow the covariance matrix approach. Since our theoretical model is linear with respect to ``BackgroundRate``, the result should be exactly the same, as in the approach with pull terms. Let's try. What's required is just to add the parameter to the ``--parameters`` argument of ``analysis`` it from minimization, and dropping the ``pulls`` dataset: .. code-block:: bash python ./gna ns --define peak.BackgroundRate central=14 relsigma=0.3 \ -- gaussianpeak --name peak --nbins 10 \ -- script scripts/gaussianpeak_data \ -- analysis --name first_analysis --dataset peak_fakedata \ --observables peak/spectrum --parameters peak.BackgroundRate \ -- chi2 first_analysis_chi2 first_analysis \ -- minimizer first_analysis_minimizer minuit first_analysis_chi2 peak.Mu \ -- fit first_analysis_minimizer And the result is: .. code-block:: bash Namespace(cpu=0.005394999999998262, errors=array([ 11.54509989]), fun=12.740754420865585, maxcv=0.01, nfev=13L, success=True, wall=0.005403995513916016, x=array([ 42.59368211])) They match exactly, as expected, so everything works fine. Of course if you have more than one nuisance parameters you can freely mix both approaches. Just don't apply both on the same parameter, it doesn't make sense. Let's do one-dimensional scans and plots for both approaches. Hopefully, you have an idea how to do it on the basis of the previous commands. Here are the commands for scanning with pull: .. code-block:: bash python ./gna ns --define peak.BackgroundRate central=14 relsigma=0.3 \ -- gaussianpeak --name peak --nbins 10 \ -- script scripts/gaussianpeak_data \ -- dataset --name pulls --pull peak.BackgroundRate \ -- analysis --name first_analysis --dataset peak_fakedata pulls \ --observables peak/spectrum peak.BackgroundRate \ -- chi2 first_analysis_chi2 first_analysis \ -- minimizer first_analysis_minimizer minuit first_analysis_chi2 peak.BackgroundRate \ -- scan --grid peak.Mu 0 150 0.5 --minimizer first_analysis_minimizer \ --verbose --output /tmp/peak_scan_1d_pulls.hdf5 and covariance: .. code-block:: bash python ./gna ns --define peak.BackgroundRate central=14 relsigma=0.3 \ -- gaussianpeak --name peak --nbins 10 \ -- script scripts/gaussianpeak_data \ -- analysis --name first_analysis --dataset peak_fakedata \ --observables peak/spectrum --parameters peak.BackgroundRate \ -- chi2 first_analysis_chi2 first_analysis \ -- minimizer first_analysis_minimizer minuit first_analysis_chi2 \ -- scan --grid peak.Mu 0 150 0.5 --minimizer first_analysis_minimizer \ --verbose --output /tmp/peak_scan_1d_covariance.hdf5 The plotting commands are also different because of different minimizers for global minimization: .. code-block:: bash python ./gna ns --define peak.BackgroundRate central=14 relsigma=0.3 \ -- gaussianpeak --name peak --nbins 10 \ -- script scripts/gaussianpeak_data \ -- dataset --name pulls --pull peak.BackgroundRate \ -- analysis --name first_analysis --dataset peak_fakedata pulls \ --observables peak/spectrum peak.BackgroundRate \ -- chi2 first_analysis_chi2 first_analysis \ -- minimizer first_analysis_minimizer minuit first_analysis_chi2 peak.Mu \ peak.BackgroundRate \ -- contour --chi2 /tmp/peak_scan_1d_pulls.hdf5 --plot chi2ci 1s 2s \ --minimizer first_analysis_minimizer --show and: .. code-block:: bash python ./gna ns --define peak.BackgroundRate central=14 relsigma=0.3 \ -- gaussianpeak --name peak --nbins 10 \ -- script scripts/gaussianpeak_data \ -- analysis --name first_analysis --dataset peak_fakedata \ --observables peak/spectrum --parameters peak.BackgroundRate \ -- chi2 first_analysis_chi2 first_analysis \ -- minimizer first_analysis_minimizer minuit first_analysis_chi2 peak.Mu \ -- contour --chi2 /tmp/peak_scan_1d_covariance.hdf5 --plot chi2ci 1s 2s \ --minimizer first_analysis_minimizer --show I'm so sorry that the command line is verbose, please fix it whenever possible. And just a few words combined analysis. You can add as many datasets, observables and parameters as you want, but this will clearly need better interface than now. You can also add correlations between experiments inside one dataset, the API is very simple: .. code-block:: bash dataset.covariate(observable1, observable2, covariation_matrix) The covariation matrix should be properly shaped (:math:`m \times n`, where :math:`m` and :math:`n` are dimensions of ``observable1`` or ``observable2`` correspondingly). ``covariation_matrix`` may be just a numpy array with numbers or computational graph output. All applicable covariations from the specified datasets will be used by ``analysis``. This is not well tested though, so expect a lot of surprises and debugging. The only usage is in the ``gna.ui.oldreactors``, please refer to it.