# Author: Volker Hoffmann <volker@cheleb.net>, <volker@physik.uzh.ch>
# Date: 13 January 2015
%matplotlib inline
import matplotlib as mpl; mpl.rcParams['savefig.dpi'] = 144
import matplotlib.pyplot as plt
import kepler_helpers as kh
import constants as C
import pandas as pd
import numpy as np
In this notebook, we'll explore how to load Genga data into Pandas dataframes. First, we will load the full Genga coordinate output (all columns). After that, we load only subset of columns from a single output, and extend the dataframe with the Keplerian elements. Finally, we demo how to load the collisions file. Along the way, we also make some plots to show how to plot from dataframes. If you're not familiar with Pandas or dataframes, you should at least have a glance at the official introduction now.
http://pandas.pydata.org/pandas-docs/stable/10min.html
Should you decide to tinker around with preprocessing steps that generate intermediate outputs, please first read the "Notes on Performance" at the end of this notebook.
Helper modules (kepler_helpers, constants) are available at https://github.com/vhffm/G3.
Looking through the Genga documentation, we find that coordinate outputs are CSV (really, space separarted) files containing the following columns.
# https://bitbucket.org/sigrimm/genga/src/tip/Documentation.md#markdown-header-the-coordinate-output-files
# t i m r x y z vx vy vz Sx Sy Sz amin amax emin emax aecount aecountT enccountT test
# 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
We can use the pandas.read_csv() function to load CSV files into dataframes. The function takes a myriad of arguments we won't discuss here. The only relevant arguments are those that we use to define our data format. You can find the full argument list in the documentation.
http://pandas.pydata.org/pandas-docs/stable/generated/pandas.io.parsers.read_csv.html
By default, imported data is cast to strings or floats. To make sure integer columns are read as such, we pass the type for these columns manually.
# Define data source, format
genga_file = "/zbox/data/volker/Impacts/Solar2/01/Out_impacts_solar2_01_000100000000.dat"
names_cols = [ "time", "pid", "mass", "radius", "x", "y", "z", "vx", "vy", "vz", "Sx", "Sy", "Sz", \
"amin", "amax", "emin", "emax", "aecount", "aecountT", "enccount", "test", "X" ]
types_cols = { "pid": np.int32, "enccount": np.int32 }
# Load data
df = pd.read_csv("%s" % genga_file, sep=" ", \
header=None, names=names_cols, dtype=types_cols)
# Output first 5 particles. Transpose the table because it's big.
df.head(5).T
The final column (seen as row here as we have transposed the table) "X" is a quirk of the Genga output format, which puts a space before the carriage return at the end of the line. Since space is our separator, Pandas tries to load a zero-width cell. Passing a lineterminator of the form "
Now we don't particularly care about all the output data. Suppose we want to focus on the masses, coordinates, and velocities of particles. We can do so by selecting a subset as follows.
# https://bitbucket.org/sigrimm/genga/src/tip/Documentation.md#markdown-header-the-coordinate-output-files
# t i m r x y z vx vy vz Sx Sy Sz amin amax emin emax aecount aecountT enccountT test
# 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
# Define data source, format
genga_file = "/zbox/data/volker/Impacts/Solar2/01/Out_impacts_solar2_01_000100000000.dat"
names_cols = [ "time", "pid", "mass", "radius", "x", "y", "z", "vx", "vy", "vz", "Sx", "Sy", "Sz", \
"amin", "amax", "emin", "emax", "aecount", "aecountT", "enccount", "test", "X" ]
touse_cols = [ 0, 1, 2, 4, 5, 6, 7, 8, 9 ]
types_cols = { "pid": np.int32 }
# Load data
df = pd.read_csv("%s" % genga_file, sep=" ", \
header=None, names=names_cols, dtype=types_cols, \
usecols=touse_cols)
# Output first 15 particles
df.head(15)
Instead of some sequentially increasing number, we can use the (hopefully unique) particle id (PID) as our index instead.
We also convert the particle mass from the unit used in the coordinate file (Solar Masses) to Earth Masses in the dataframe.
# https://bitbucket.org/sigrimm/genga/src/tip/Documentation.md#markdown-header-the-coordinate-output-files
# t i m r x y z vx vy vz Sx Sy Sz amin amax emin emax aecount aecountT enccountT test
# 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
# Define data source, format
genga_file = "/zbox/data/volker/Impacts/Solar2/01/Out_impacts_solar2_01_000100000000.dat"
names_cols = [ "time", "pid", "mass", "radius", "x", "y", "z", "vx", "vy", "vz", "Sx", "Sy", "Sz", \
"amin", "amax", "emin", "emax", "aecount", "aecountT", "enccount", "test", "X" ]
touse_cols = [ 0, 1, 2, 4, 5, 6, 7, 8, 9 ]
types_cols = { "pid": np.int32 }
# Load data
df = pd.read_csv("%s" % genga_file, sep=" ", \
header=None, names=names_cols, dtype=types_cols, \
usecols=touse_cols, \
index_col=1)
# Convert mass
df.mass *= C.msun/C.mearth
# Output first 15 particles
df.head(15)
Finally, we wish the compute, and append the Keplerian elements to the dataframe. We use barycentric coordinates for this.
# https://bitbucket.org/sigrimm/genga/src/tip/Documentation.md#markdown-header-the-coordinate-output-files
# t i m r x y z vx vy vz Sx Sy Sz amin amax emin emax aecount aecountT enccountT test
# 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
# Define data source, format
genga_file = "/zbox/data/volker/Impacts/Solar2/01/Out_impacts_solar2_01_000100000000.dat"
names_cols = [ "time", "pid", "mass", "radius", "x", "y", "z", "vx", "vy", "vz", "Sx", "Sy", "Sz", \
"amin", "amax", "emin", "emax", "aecount", "aecountT", "enccount", "test", "X" ]
touse_cols = [ 0, 1, 2, 4, 5, 6, 7, 8, 9 ]
types_cols = { "pid": np.int32 }
# Load data
df = pd.read_csv("%s" % genga_file, sep=" ", \
header=None, names=names_cols, dtype=types_cols, \
usecols=touse_cols, \
index_col=1)
# We use np.asarray() to extract numpy arrays from the dataframe for further processing
x = np.asarray(df.x); y = np.asarray(df.y); z = np.asarray(df.z)
vx = np.asarray(df.vx); vy = np.asarray(df.vy); vz = np.asarray(df.vz)
m = np.asarray(df.mass)
# Now we convert to barycentric coordinates
x, vx = kh.helio2bary(x, vx, m)
y, vy = kh.helio2bary(y, vy, m)
z, vz = kh.helio2bary(z, vz, m)
# Compute Keplerian elements, and append the columns to the dataframe
df["a"], df["e"], df["i"], df["Omega"], df["omega"], df["M"] = kh.cart2kepX(x, y, z, vx, vy, vz, m)
# Finally, convert mass again
df.mass *= C.msun/C.mearth
# Output Keplerian elements for the first 15 particles
df[["mass", "a", "e", "i", "Omega", "omega", "M"]].head(15)
Finally, we make a hexagonally binned density plot for all particles that have zero mass (these are test particles) and overlay the Kepler ellipses for the massive particles. Since this is a IPython notebook, we still have access to the previously loaded dataframe, so we do not need to load anything at this point.
# Compute Kepler Ellises
a0 = np.asarray(df[df.mass>0].a)
e0 = np.asarray(df[df.mass>0].e)
i0 = np.asarray(df[df.mass>0].i)
Omega0 = np.asarray(df[df.mass>0].Omega)
omega0 = np.asarray(df[df.mass>0].omega)
xell, yell, _ = kh.compute_ellipseX(a0, e0, i0, Omega0, omega0)
fig, ax = plt.subplots(1,1)
ax.set_aspect("equal")
ax.hexbin(df[df.mass==0].x, df[df.mass==0].y, cmap=mpl.cm.bone_r, \
mincnt=1, bins="log", gridsize=128, extent=[-60, 60, -60, 60], \
edgecolors="grey", linewidths=0.0, alpha=1.0)
ax.plot(xell.T, yell.T)
ax.set_title("Solar2")
ax.set_xlabel("X (AU)")
ax.set_ylabel("Y (AU)")
pass
Finally, we demo how to load the data from the collision files, and some light filtering.
# Define data source, format
genga_file = "/zbox/data/volker/Impacts/Solar2/01/Collisions_impacts_solar2_01.dat"
names_cols = [ "time", \
"indexi", "mi", "ri", "xi", "yi", "zi", "vxi", "vyi", "vzi", "Sxi", "Syi", "Szi",\
"indexj", "mj", "rj", "xj", "yj", "zj", "vxj", "vyj", "vzj", "Sxj", "Syj", "Szj", "X" ]
touse_cols = [ 0, 1, 2, 4, 5, 6, 13, 14, 16, 17, 18 ]
types_cols = { "indexi": np.int32, "indexj": np.int32 }
# Load data
df = pd.read_csv("%s" % genga_file, sep=" ", \
header=None, names=names_cols, dtype=types_cols, \
usecols=touse_cols)
# Fix Masses
df.mi = df.mi * C.msun/C.mearth
df.mj = df.mj * C.msun/C.mearth
# Display first 15 collisions involving particle 4
df[np.logical_or(df.indexi==4, df.indexj==4)][:15]
Earlier implementations of the Genga post-processing pipeline precomputed orbital elements as well as Keplerian ellipses, and stored them in (first) Numpy archives (.npz) or (later) HDF5 files. At the time, the routines to compute Kepler elements as well as ellipses consisted of excruciatingly slow for-loops.
The current implementation of processing routines is much faster, allowing us to bypass tedious (and storage space consuming) precomputing of orbital elements and Keplerian ellipses. In fact, it turns out that loading an HDF5 file (100'000 particles) with precomputed Kepler elements can be slower than using Pandas to read the CSV file and computing the Kepler element on-the-fly.