1
0
-1

I've been running the following piece of python code using the easyaccess python package:

def se_object_query(args):

    import easyaccess as ea 
    import numpy as np 

    if args.verbose>0: 
        print 
        print '* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *'
        print 'se_object_query'
        print '* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *'
        print 

    connection=ea.connect('desoper') 

    query="""
    with x as (
        select /*+materialise */ pfw_attempt_id
            from prod.proctag
            where tag='%s'
        ),
    y as (
        select /*+materialize */ a.expnum as expnum ,max(a.lastchanged_time) as evaltime
            from prod.firstcut_eval a
            where a.analyst!='SNQUALITY'
            group by a.expnum
          ),
    z as (
        select /*+materialize */ b.expnum
            from prod.firstcut_eval b, y
            where b.expnum=y.expnum
            AND b.lastchanged_time=y.evaltime
            AND b.accepted='True'
        )
    SELECT c.expnum, c.filename,
           substr(o.band,1,1) as BAND,
           c.ccdnum as CCDNUM,
           o.object_number, o.x_image, o.y_image, o.ra, o.dec,
           o.flux_psf, o.fluxerr_psf,
           3600.*o.fwhm_world as fwhm_arcsec,
           o.class_star, o.spread_model, o.spreaderr_model, o.flags
    FROM   prod.se_object o, prod.catalog c, x, z   
    WHERE x.pfw_attempt_id=c.pfw_attempt_id
        AND c.filetype='cat_finalcut'
        AND c.expnum=z.expnum
        AND o.filename=c.filename
        AND ( (o.class_star > 0.8) OR ((o.spread_model + 3.*o.spreaderr_model) between -0.003 AND 0.003) )
        AND o.flux_psf > 2000.
        AND o.flags = 0"""  % (args.tag)

    if args.verbose>0: print query

    outputFiles="""%s.objinfo.csv""" % (args.baseName)

    connection.query_and_save(query,outputFiles)

    connection.close()

    if args.verbose>0: print

    return 0

 

It's a big query.  I think the query itself goes quite fast for such a large query, but I think the download speeds are quite variable (ranging from a few hours to many, many hours), at least when using CSV.  I am also now testing downloading to a FITS table instead.  I've heard that downloading to FITS tends to be faster, but I was just wondering if Matias Carrasco Kind or Felipe Menanteau or anyone else might have further suggestions.  Is the "query_and_save" method the way to go?  Or are there faster or more recommended means of achieving fast download of a query result to a file?

Thanks!

Best regards,

Douglas

 

  1. Douglas Tucker

    E.g., see the timestamps for the files created from running the above query on Y2N_FIRSTCUT overnight (and it is still running 23 hours later!):

     

    bash-4.1$ ls -Flthr rawdata.y2n.objinfo*
    -rw-rw-r-- 1 dtucker sdss 2.0G Feb  4 16:15 rawdata.y2n.objinfo_000001.csv
    -rw-rw-r-- 1 dtucker sdss 2.0G Feb  4 17:30 rawdata.y2n.objinfo_000002.csv
    -rw-rw-r-- 1 dtucker sdss 2.0G Feb  4 18:36 rawdata.y2n.objinfo_000003.csv
    -rw-rw-r-- 1 dtucker sdss 2.0G Feb  4 19:39 rawdata.y2n.objinfo_000004.csv
    -rw-rw-r-- 1 dtucker sdss 2.0G Feb  4 20:50 rawdata.y2n.objinfo_000005.csv
    -rw-rw-r-- 1 dtucker sdss 2.0G Feb  4 22:35 rawdata.y2n.objinfo_000006.csv
    -rw-rw-r-- 1 dtucker sdss 2.0G Feb  5 00:59 rawdata.y2n.objinfo_000007.csv
    -rw-rw-r-- 1 dtucker sdss 2.0G Feb  5 04:20 rawdata.y2n.objinfo_000008.csv
    -rw-rw-r-- 1 dtucker sdss 2.0G Feb  5 09:26 rawdata.y2n.objinfo_000009.csv
    -rw-rw-r-- 1 dtucker sdss 1.8G Feb  5 15:10 rawdata.y2n.objinfo_000010.csv

     

    Note in particular, the first few "chunks" (CSV files) take about 1h15m to output, but the time increases for the later "chunks" (CSV files); e.g., rawdata.y2n.objinfo_000009.csv took 5 hours to output.

     

  2. Douglas Tucker

    And here are the timestamps for running the query on Y3_FIRSTCUT overnight (and is still running):

    bash-4.1$ ls -Flthr rawdata.y3n.objinfo*
    -rw-rw-r-- 1 dtucker sdss 2.0G Feb  4 15:25 rawdata.y3n.objinfo_000001.csv
    -rw-rw-r-- 1 dtucker sdss 2.0G Feb  4 15:37 rawdata.y3n.objinfo_000002.csv
    -rw-rw-r-- 1 dtucker sdss 2.0G Feb  4 15:50 rawdata.y3n.objinfo_000003.csv
    -rw-rw-r-- 1 dtucker sdss 2.0G Feb  4 16:03 rawdata.y3n.objinfo_000004.csv
    -rw-rw-r-- 1 dtucker sdss 2.0G Feb  4 16:16 rawdata.y3n.objinfo_000005.csv
    -rw-rw-r-- 1 dtucker sdss 2.0G Feb  4 16:31 rawdata.y3n.objinfo_000006.csv
    -rw-rw-r-- 1 dtucker sdss 2.0G Feb  4 16:47 rawdata.y3n.objinfo_000007.csv
    -rw-rw-r-- 1 dtucker sdss 2.0G Feb  4 17:03 rawdata.y3n.objinfo_000008.csv
    -rw-rw-r-- 1 dtucker sdss 2.0G Feb  4 17:26 rawdata.y3n.objinfo_000009.csv
    -rw-rw-r-- 1 dtucker sdss 1.8G Feb  5 13:08 rawdata.y3n.objinfo_000010.csv

     

    That last file has been dribbling across for 22 hours!  The same query a few days ago took only 7 hours to run and download all the files (if the output is the same as a few days ago, there should be 11 files in total, so I am waiting for one more file beyond rawdata.y3n.objinfo_000010.csv).

     

  3. David Gerdes

    I would think that the time for a complex query to complete is driven by database latency, not by the output file format. Just guessing though. For what it's worth I'm fond of the query_to_pandas() method, followed by to_csv() if I want to save the dataframe to a file. 

  4. Douglas Tucker

    Thanks, David Gerdes!  I have only recently been introduced to pandas, but I admit I really like what I have seen.  My concern about query_to_pandas() is that I am querying for multi-gigabytes of data, which won't all hold in RAM (weeeellll, maybe on des40.fnal.gov, but I am trying to avoid being too greedy!).  Can query_to_pandas() deal with multi-gigabytes of data?  Thanks!

     

CommentAdd your comment...

2 answers

  1.  
    2
    1
    0

    I agree that from what I've see hdf 5 is fastest. For writing FITS there is an extra conversion from the pandas data frame into a numpy recarray before write:

    https://github.com/mgckind/easyaccess/blob/b9585ba4d645336a4e42321bb7d60ad36103da3d/easyaccess/easyaccess.py#L244-L272

    However, then the writing routine is quite different and I haven't actually timed anything.

    1. Douglas Tucker

      Thanks, Alex! 

      I haven't used hdf5 before.  I will google around, but do you recommend any particular packages for dealing with hdf5 and/or converting hdf5 to other formats (e.g., CSV, FITS)?  Thanks!

    2. Matias Carrasco Kind

      There are several packages out there to deal with HDF5, easyaccess uses h5py  http://www.h5py.org/ and pandas offer a good interface between HDF5 and csv. For the fits - hdf5 connection there are several packages out there, which might depend on the preferred language. I haven't used any of them myself as I usually do  the conversion by hand using fistio, but here there is a possible option http://fits2hdf.readthedocs.org/en/latest/.

    CommentAdd your comment...
  2.  
    2
    1
    0

    My two cents, 

     

    What David Gerdes said applies well in this case, for complex queries the output format wouldn't make much a difference.

    And internally easyaccess is doing exactly that, query_to_pandas then to csv, fits or hdf5.  So query_to_save make those steps with the advantage of writing multiple files so it doesn't run out of memory. From my experience hdf5 is definitely faster with high compress ratio although a less standard data file. csv seems to be faster in some cases, although fits provides compression which can be good to handle big files later on.

    Although there are 2 key configuration parameters that might help beyond the smart things you can do in the query itself and the data format.

    (to see all parameters run DESDB ~> config all show )

    These can be changed using : DESDB ~>  config <parameter> set <value>, 

    1) prefetch: This is the number of rows oracle provide on each trip. Its very important and can affect time by a good amount. Too small you'd need to make several trips to the DB and for each trip you'd need to process the data on your side (to put it in the right format), you want to make as few trip as possible. Too large you might run in memory issues when bringing too much data at time. so in general,
    few columns -> increase prefetch (say even 500,000 or a million)
    several columns -> decrease it (few thousands)
    default is 10K

    2) outfile_max_mb: the maximum size per file in MB, default is 1000 (1Gb), increasing this will reduce the number of files, but will reduce I/O which can improve the performance in general

    I hope this helps,

    1. Douglas Tucker

      Thanks, Matias Carrasco Kind!

      I will try working with different values of prefetch and outfile_max_mb.

      Thanks!

    CommentAdd your comment...