In [1]:
# Author: Volker Hoffmann <volker@cheleb.net>, <volker@physik.uzh.ch>
# Date: 13 January 2015
In [2]:
%matplotlib inline
In [3]:
import matplotlib as mpl; mpl.rcParams['savefig.dpi'] = 144
import matplotlib.pyplot as plt
import kepler_helpers as kh
import constants as C
import pandas as pd
import numpy as np

Genga & Pandas

In this notebook, we'll explore how to load Genga data into Pandas dataframes. First, we will load the full Genga coordinate output (all columns). After that, we load only subset of columns from a single output, and extend the dataframe with the Keplerian elements. Finally, we demo how to load the collisions file. Along the way, we also make some plots to show how to plot from dataframes. If you're not familiar with Pandas or dataframes, you should at least have a glance at the official introduction now.

http://pandas.pydata.org/pandas-docs/stable/10min.html

Should you decide to tinker around with preprocessing steps that generate intermediate outputs, please first read the "Notes on Performance" at the end of this notebook.

Helper modules (kepler_helpers, constants) are available at https://github.com/vhffm/G3.

Load Full Coordinate Output

Looking through the Genga documentation, we find that coordinate outputs are CSV (really, space separarted) files containing the following columns.

In [4]:
# https://bitbucket.org/sigrimm/genga/src/tip/Documentation.md#markdown-header-the-coordinate-output-files
# t i m r x y z vx vy vz Sx Sy Sz amin amax emin emax aecount aecountT enccountT test
# 0 1 2 3 4 5 6  7  8  9 10 11 12   13   14   15   16      17       18        19   20

We can use the pandas.read_csv() function to load CSV files into dataframes. The function takes a myriad of arguments we won't discuss here. The only relevant arguments are those that we use to define our data format. You can find the full argument list in the documentation.

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html

By default, imported data is cast to strings or floats. To make sure integer columns are read as such, we pass the type for these columns manually.

In [5]:
# Define data source, format
genga_file = "/zbox/data/volker/Impacts/Solar2/01/Out_impacts_solar2_01_000100000000.dat"
names_cols = [ "time", "pid", "mass", "radius", "x", "y", "z", "vx", "vy", "vz", "Sx", "Sy", "Sz", \
               "amin", "amax", "emin", "emax", "aecount", "aecountT", "enccount", "test", "X" ]
types_cols = { "pid": np.int32, "enccount": np.int32 }

# Load data
df = pd.read_csv("%s" % genga_file, sep=" ", \
                 header=None, names=names_cols, dtype=types_cols)

# Output first 5 particles. Transpose the table because it's big.
df.head(5).T
Out[5]:
0 1 2 3 4
time 1642710.472279 1642710.472279 1642710.472279 1642710.472279 1642710.472279
pid 0.000000 1.000000 2.000000 3.000000 4.000000
mass 0.000000 0.000002 0.000003 0.000000 0.000954
radius 0.000016 0.000040 0.000043 0.000023 0.000467
x -0.387164 -0.582620 0.183459 -1.352997 -2.625899
y -0.057947 0.411237 -0.951225 -0.375876 4.356407
z 0.022029 0.028701 -0.008535 0.178163 0.002739
vx 0.395149 -0.674428 1.013811 0.238575 -0.372430
vy -1.525691 -0.982117 0.190160 -0.836274 -0.249548
vz -0.183160 0.017311 -0.020417 0.017911 0.013392
Sx 0.000000 0.000000 0.000000 0.000000 0.000000
Sy 0.000000 0.000000 0.000000 0.000000 0.000000
Sz 0.000000 0.000000 0.000000 0.000000 0.000000
amin 0.000000 0.000000 0.000000 0.000000 0.000000
amax 10000.000000 10000.000000 10000.000000 10000.000000 10000.000000
emin 0.000000 0.000000 0.000000 0.000000 0.000000
emax 1.000000 1.000000 1.000000 1.000000 1.000000
aecount 1.000000 1.000000 1.000000 1.000000 1.000000
aecountT 1.000000 1.000000 1.000000 1.000000 1.000000
enccount 0.000000 0.000000 0.000000 0.000000 0.000000
test -0.000000 -0.000002 -0.000002 -0.000000 -0.000092
X NaN NaN NaN NaN NaN

The final column (seen as row here as we have transposed the table) "X" is a quirk of the Genga output format, which puts a space before the carriage return at the end of the line. Since space is our separator, Pandas tries to load a zero-width cell. Passing a lineterminator of the form "\n" to the read_csv() function could be a workaround, except that we cannot pass line terminator string longer than 1 character to the function.

Load Column Subset

Now we don't particularly care about all the output data. Suppose we want to focus on the masses, coordinates, and velocities of particles. We can do so by selecting a subset as follows.

In [6]:
# https://bitbucket.org/sigrimm/genga/src/tip/Documentation.md#markdown-header-the-coordinate-output-files
# t i m r x y z vx vy vz Sx Sy Sz amin amax emin emax aecount aecountT enccountT test
# 0 1 2 3 4 5 6  7  8  9 10 11 12   13   14   15   16      17       18        19   20

# Define data source, format
genga_file = "/zbox/data/volker/Impacts/Solar2/01/Out_impacts_solar2_01_000100000000.dat"
names_cols = [ "time", "pid", "mass", "radius", "x", "y", "z", "vx", "vy", "vz", "Sx", "Sy", "Sz", \
               "amin", "amax", "emin", "emax", "aecount", "aecountT", "enccount", "test", "X" ]
touse_cols = [ 0, 1, 2, 4, 5, 6, 7, 8, 9 ]
types_cols = { "pid": np.int32 }

# Load data
df = pd.read_csv("%s" % genga_file, sep=" ", \
                 header=None, names=names_cols, dtype=types_cols, \
                 usecols=touse_cols)

# Output first 15 particles
df.head(15)
Out[6]:
time pid mass x y z vx vy vz
0 1642710.472279 0 1.651501e-07 -0.387164 -0.057947 0.022029 0.395149 -1.525691 -0.183160
1 1642710.472279 1 2.446835e-06 -0.582620 0.411237 0.028701 -0.674428 -0.982117 0.017311
2 1642710.472279 2 3.002363e-06 0.183459 -0.951225 -0.008535 1.013811 0.190160 -0.020417
3 1642710.472279 3 3.212508e-07 -1.352997 -0.375876 0.178163 0.238575 -0.836274 0.017911
4 1642710.472279 4 9.542004e-04 -2.625899 4.356407 0.002739 -0.372430 -0.249548 0.013392
5 1642710.472279 5 2.857071e-04 -7.436435 5.449103 0.253934 -0.206476 -0.264500 0.004105
6 1642710.472279 6 4.364285e-05 -18.427686 6.898206 0.686172 -0.086148 -0.205813 0.002101
7 1642710.472279 7 5.148057e-05 -5.919358 -29.500837 0.508090 0.178532 -0.034932 -0.002023
8 1642710.472279 8 6.580866e-09 -17.343162 -42.986987 -9.428381 0.110970 -0.064543 0.016620
9 1642710.472279 149700 0.000000e+00 -21.477568 -28.095809 4.578356 0.090324 -0.108176 -0.008064
10 1642710.472279 249384 0.000000e+00 -14.131015 -22.563444 0.369514 0.159730 -0.108941 -0.005806
11 1642710.472279 387916 0.000000e+00 2.137355 -39.007327 -2.393485 0.161909 0.010575 -0.000573
12 1642710.472279 321754 0.000000e+00 24.360460 -24.403475 -0.504773 0.117468 0.121468 -0.008416
13 1642710.472279 180150 0.000000e+00 -12.018206 5.223603 3.006129 -0.173060 -0.255636 0.010850
14 1642710.472279 272902 0.000000e+00 21.407919 12.954776 -1.585356 -0.025430 0.209020 0.006014

Use Particle ID as Index, Convert Mass

Instead of some sequentially increasing number, we can use the (hopefully unique) particle id (PID) as our index instead.

We also convert the particle mass from the unit used in the coordinate file (Solar Masses) to Earth Masses in the dataframe.

In [7]:
# https://bitbucket.org/sigrimm/genga/src/tip/Documentation.md#markdown-header-the-coordinate-output-files
# t i m r x y z vx vy vz Sx Sy Sz amin amax emin emax aecount aecountT enccountT test
# 0 1 2 3 4 5 6  7  8  9 10 11 12   13   14   15   16      17       18        19   20

# Define data source, format
genga_file = "/zbox/data/volker/Impacts/Solar2/01/Out_impacts_solar2_01_000100000000.dat"
names_cols = [ "time", "pid", "mass", "radius", "x", "y", "z", "vx", "vy", "vz", "Sx", "Sy", "Sz", \
               "amin", "amax", "emin", "emax", "aecount", "aecountT", "enccount", "test", "X" ]
touse_cols = [ 0, 1, 2, 4, 5, 6, 7, 8, 9 ]
types_cols = { "pid": np.int32 }

# Load data
df = pd.read_csv("%s" % genga_file, sep=" ", \
                 header=None, names=names_cols, dtype=types_cols, \
                 usecols=touse_cols, \
                 index_col=1)

# Convert mass
df.mass *= C.msun/C.mearth

# Output first 15 particles
df.head(15)
Out[7]:
time mass x y z vx vy vz
pid
0 1642710.472279 0.054987 -0.387164 -0.057947 0.022029 0.395149 -1.525691 -0.183160
1 1642710.472279 0.814674 -0.582620 0.411237 0.028701 -0.674428 -0.982117 0.017311
2 1642710.472279 0.999636 0.183459 -0.951225 -0.008535 1.013811 0.190160 -0.020417
3 1642710.472279 0.106960 -1.352997 -0.375876 0.178163 0.238575 -0.836274 0.017911
4 1642710.472279 317.700915 -2.625899 4.356407 0.002739 -0.372430 -0.249548 0.013392
5 1642710.472279 95.126149 -7.436435 5.449103 0.253934 -0.206476 -0.264500 0.004105
6 1642710.472279 14.530883 -18.427686 6.898206 0.686172 -0.086148 -0.205813 0.002101
7 1642710.472279 17.140450 -5.919358 -29.500837 0.508090 0.178532 -0.034932 -0.002023
8 1642710.472279 0.002191 -17.343162 -42.986987 -9.428381 0.110970 -0.064543 0.016620
149700 1642710.472279 0.000000 -21.477568 -28.095809 4.578356 0.090324 -0.108176 -0.008064
249384 1642710.472279 0.000000 -14.131015 -22.563444 0.369514 0.159730 -0.108941 -0.005806
387916 1642710.472279 0.000000 2.137355 -39.007327 -2.393485 0.161909 0.010575 -0.000573
321754 1642710.472279 0.000000 24.360460 -24.403475 -0.504773 0.117468 0.121468 -0.008416
180150 1642710.472279 0.000000 -12.018206 5.223603 3.006129 -0.173060 -0.255636 0.010850
272902 1642710.472279 0.000000 21.407919 12.954776 -1.585356 -0.025430 0.209020 0.006014

Compute, Append Kepler Elements to Dataframe

Finally, we wish the compute, and append the Keplerian elements to the dataframe. We use barycentric coordinates for this.

In [8]:
# https://bitbucket.org/sigrimm/genga/src/tip/Documentation.md#markdown-header-the-coordinate-output-files
# t i m r x y z vx vy vz Sx Sy Sz amin amax emin emax aecount aecountT enccountT test
# 0 1 2 3 4 5 6  7  8  9 10 11 12   13   14   15   16      17       18        19   20

# Define data source, format
genga_file = "/zbox/data/volker/Impacts/Solar2/01/Out_impacts_solar2_01_000100000000.dat"
names_cols = [ "time", "pid", "mass", "radius", "x", "y", "z", "vx", "vy", "vz", "Sx", "Sy", "Sz", \
               "amin", "amax", "emin", "emax", "aecount", "aecountT", "enccount", "test", "X" ]
touse_cols = [ 0, 1, 2, 4, 5, 6, 7, 8, 9 ]
types_cols = { "pid": np.int32 }

# Load data
df = pd.read_csv("%s" % genga_file, sep=" ", \
                 header=None, names=names_cols, dtype=types_cols, \
                 usecols=touse_cols, \
                 index_col=1)

# We use np.asarray() to extract numpy arrays from the dataframe for further processing
x = np.asarray(df.x); y = np.asarray(df.y); z = np.asarray(df.z)
vx = np.asarray(df.vx); vy = np.asarray(df.vy); vz = np.asarray(df.vz)
m = np.asarray(df.mass)

# Now we convert to barycentric coordinates
x, vx = kh.helio2bary(x, vx, m)
y, vy = kh.helio2bary(y, vy, m)
z, vz = kh.helio2bary(z, vz, m)

# Compute Keplerian elements, and append the columns to the dataframe 
df["a"], df["e"], df["i"], df["Omega"], df["omega"], df["M"] = kh.cart2kepX(x, y, z, vx, vy, vz, m)

# Finally, convert mass again
df.mass *= C.msun/C.mearth

# Output Keplerian elements for the first 15 particles
df[["mass", "a", "e", "i", "Omega", "omega", "M"]].head(15)
Out[8]:
mass a e i Omega omega M
pid
0 0.054987 0.377428 0.100442 0.124449 0.632254 4.593984 4.550777
1 0.814674 0.707926 0.012086 0.043150 1.312271 2.623852 4.898652
2 0.999636 1.012737 0.039237 0.021623 1.342315 3.299174 0.246332
3 0.106960 1.512882 0.067213 0.128188 2.009638 1.419053 6.273927
4 317.700915 5.175568 0.051665 0.029932 2.095590 1.270419 5.126872
5 95.126149 9.542789 0.045845 0.029803 1.330314 0.422877 0.694843
6 14.530883 19.216074 0.044644 0.035787 1.440379 5.445115 2.106777
7 17.140450 30.089624 0.006962 0.020179 2.363647 3.760937 4.687009
8 0.002191 39.254997 0.231511 0.252672 5.236245 2.602867 2.551499
149700 0.000000 27.657564 0.351048 0.155012 1.894643 5.725117 2.340797
249384 0.000000 26.568763 0.036359 0.033591 1.437163 1.048883 1.594326
387916 0.000000 40.604173 0.037892 0.061341 0.125869 4.974041 5.974763
321754 0.000000 34.264629 0.017744 0.052049 2.070486 5.397729 4.345121
180150 0.000000 18.613557 0.330923 0.225711 1.129905 0.813398 0.393604
272902 0.000000 28.351173 0.422840 0.086975 1.357282 3.727204 0.887263

Hexbin Plot, Orbits

Finally, we make a hexagonally binned density plot for all particles that have zero mass (these are test particles) and overlay the Kepler ellipses for the massive particles. Since this is a IPython notebook, we still have access to the previously loaded dataframe, so we do not need to load anything at this point.

In [9]:
# Compute Kepler Ellises
a0 = np.asarray(df[df.mass>0].a)
e0 = np.asarray(df[df.mass>0].e)
i0 = np.asarray(df[df.mass>0].i)
Omega0 = np.asarray(df[df.mass>0].Omega)
omega0 = np.asarray(df[df.mass>0].omega)
xell, yell, _ = kh.compute_ellipseX(a0, e0, i0, Omega0, omega0)
In [10]:
fig, ax = plt.subplots(1,1)
ax.set_aspect("equal")
ax.hexbin(df[df.mass==0].x, df[df.mass==0].y, cmap=mpl.cm.bone_r, \
          mincnt=1, bins="log", gridsize=128, extent=[-60, 60, -60, 60], \
          edgecolors="grey", linewidths=0.0, alpha=1.0)
ax.plot(xell.T, yell.T)
ax.set_title("Solar2")
ax.set_xlabel("X (AU)")
ax.set_ylabel("Y (AU)")
pass

Collision Files

Finally, we demo how to load the data from the collision files, and some light filtering.

In [11]:
# Define data source, format
genga_file = "/zbox/data/volker/Impacts/Solar2/01/Collisions_impacts_solar2_01.dat"
names_cols = [ "time", \
               "indexi", "mi", "ri", "xi", "yi", "zi", "vxi", "vyi", "vzi", "Sxi", "Syi", "Szi",\
               "indexj", "mj", "rj", "xj", "yj", "zj", "vxj", "vyj", "vzj", "Sxj", "Syj", "Szj", "X" ]
touse_cols = [ 0, 1, 2, 4, 5, 6, 13, 14, 16, 17, 18 ]
types_cols = { "indexi": np.int32, "indexj": np.int32 }

# Load data
df = pd.read_csv("%s" % genga_file, sep=" ", \
                 header=None, names=names_cols, dtype=types_cols, \
                 usecols=touse_cols)

# Fix Masses
df.mi = df.mi * C.msun/C.mearth
df.mj = df.mj * C.msun/C.mearth

# Display first 15 collisions involving particle 4
df[np.logical_or(df.indexi==4, df.indexj==4)][:15]
Out[11]:
time indexi mi xi yi zi indexj mj xj yj zj
0 2.91165 4 317.700784 -5.402800 -0.73829 0.000509 28234 0 -5.402950 -0.738593 0.124217
3 3.52892 4 317.700784 -4.947020 -2.28634 -0.002431 39995 0 -4.947380 -2.286090 0.120143
4 4.29536 4 317.700784 -3.753530 -3.89958 -0.005776 40395 0 -3.753880 -3.899680 0.099946
5 10.80380 4 317.700784 1.620490 4.79073 0.008745 30378 0 1.620860 4.790690 -0.056029
6 13.14330 4 317.700784 -4.226230 3.27953 0.007487 33226 0 -4.226490 3.279170 0.080977
7 19.15590 4 317.700784 3.617830 -3.53990 -0.008143 33233 0 3.618150 -3.540160 -0.066286
8 22.54980 4 317.700784 1.919780 4.66495 0.008410 37261 0 1.920070 4.664850 -0.062186
9 22.68300 4 317.700784 1.570760 4.80918 0.008791 37182 0 1.570480 4.809470 -0.054982
11 33.25050 4 317.700784 4.321050 2.42982 0.003253 23713 0 4.321260 2.430080 -0.106841
12 34.02980 4 317.700784 2.859310 4.10692 0.007047 23687 0 2.859730 4.106800 -0.081198
13 34.81070 4 317.700784 0.853096 5.01418 0.009409 36578 0 0.853058 5.013890 -0.040375
14 39.00210 4 317.700784 -5.053290 -2.04071 -0.001928 40581 0 -5.053220 -2.040980 0.121290
15 39.96440 4 317.700784 -3.566570 -4.06407 -0.006120 40785 0 -3.566550 -4.063630 0.096919
16 45.06780 4 317.700784 4.375750 2.32287 0.003016 40001 0 4.375480 2.323130 -0.107824
17 46.52600 4 317.700784 1.242730 4.91422 0.009092 36381 0 1.242410 4.913920 -0.048537

Notes on Performance

Earlier implementations of the Genga post-processing pipeline precomputed orbital elements as well as Keplerian ellipses, and stored them in (first) Numpy archives (.npz) or (later) HDF5 files. At the time, the routines to compute Kepler elements as well as ellipses consisted of excruciatingly slow for-loops.

The current implementation of processing routines is much faster, allowing us to bypass tedious (and storage space consuming) precomputing of orbital elements and Keplerian ellipses. In fact, it turns out that loading an HDF5 file (100'000 particles) with precomputed Kepler elements can be slower than using Pandas to read the CSV file and computing the Kepler element on-the-fly.