C# Project – Scraping Files for Data

I was literally receiving thousands of emails in Outlook with regard to a specific network service error, many duplicates, that quickly surpassed my ability to copy, paste and manually process that I got fed up and looked for a more creative way to handle it.

At first I thought about creating a macro in Outlook to programmatically scrape the emails and save information they contained but macros are a security hazard and actually locked down by a network policy anyways. Alternatively I opted to make a utility in C#.

Creating a utility to access Office COM objects, namely Outlook, did nothing for me but try my patience and then I thought why not have the emails saved as text files to a folder as they come in. After explaining my intentions to the network admin, he set up a script on the mail server to save specific emails to a share as text files. Excellent.

So now with the emails being saved as plain text files, I could create a text scraper to pull out data I needed and process. However, the data I needed was sandwiched inside an error with text that ran contiguous (ex: errorfound:D2A1AB9C83=|0X0C0MID:3:3) so I had to process the files by searching for a string and grabbing the text between beginning and ending characters. What I needed was simply in between the colon and the pipe characters which were always static in what they contained. Below is the parser function and button function code where I added to save the list to a file also.

//String parser
public string ParseBetween(string Subject, string Start, string End)
{
        return Regex.Match(Subject, Regex.Replace(Start, @"[][{}()*+?.\\^$|]", @"\$0") + @"\s*(((?!" + Regex.Replace(Start, @"[][{}()*+?.\\^$|]", @"\$0") + @"|" + Regex.Replace(End, @"[][{}()*+?.\\^$|]", @"\$0") + @").)+)\s*" + Regex.Replace(End, @"[][{}()*+?.\\^$|]", @"\$0"), RegexOptions.IgnoreCase).Value.Replace(Start, "").Replace(End, "");
}

//Parse between two strings and grab that contents as new string
private void button1_Click(object sender, EventArgs e)
{
    textBox1.Clear();
    tsNotify.Text = "";
        StringBuilder strFile = new StringBuilder();
        string s2 = "errorfound:";  //beginning string
        string s3 = "|0X0C0MID:3:3";    //end string
        //files to parse
        foreach (string file in Directory.EnumerateFiles(@"\\server\SomeErrors\", "*.txt"))
        {
            string contents = File.ReadAllText(file);
            string strParsed = ParseBetween(contents, s2, s3);
            //Clean up the string
            string clean = Regex.Replace(strParsed, "[^A-Za-z0-9 ]", "");
            textBox1.AppendText(clean + "\r\n");
        }
        using (StreamWriter objWriter = new StreamWriter(@"C:\ServerErrors.txt"))
        {
            objWriter.Write(textBox1.Text);
            objWriter.Flush();
            tsNotify.Text = "Parsed and saved to C:\\ServerErrors.txt";
        }
}

As the data got pulled from the text files, I experienced some string anomalies such blank lines (sometimes several in a row) and white spaces because not all of the emails with the title being saved related to the error so those emails made it to the list as blank entries. Trimming the strings helped with that.

were showing up as blank entries in the list.

private void button5_Click(object sender, EventArgs e)
{
    string filePath = "C:\\ServerErrors.txt";
    //remove any empty lines
    string[] lines = File.ReadAllLines(filePath).Where(s => s.Trim() != string.Empty).ToArray();
    listBox1.Items.AddRange(lines);
    tsNotify.Text = "Errors added for processing";
}

Once processed, I removed any duplicate data using the code below:

private void button6_Click(object sender, EventArgs e)
{
    string[] arr = new string[listBox1.Items.Count];
    listBox1.Items.CopyTo(arr, 0);
    var arr2 = arr.Distinct();
    listBox1.Items.Clear();
    foreach (string s in arr2)
    {
        string clean = Regex.Replace(s, "[^A-Za-z0-9 ]", "");
        listBox1.Items.Add(clean);
    }
    tsNotify.Text = "Duplicates removed";
}

Once finished, I processed a final list of errors that were minus white spaces, blanks lines and duplicates.

private void button7_Click(object sender, EventArgs e)
{
    foreach (object liItem in listBox1.Items)
        textBox2.Text += liItem.ToString() + "\r\n";
    tsNotify.Text = "Final list ready to copy";
}

After a few iterations I was successful at scraping and processing the data I needed and since each time I wanted to process only newer text files, at the end of each run, I would delete the files on the server share.

private void button2_Click(object sender, EventArgs e)
{
    System.IO.DirectoryInfo di = new DirectoryInfo(@"\\server\SomeErrors\");
    foreach (FileInfo file in di.GetFiles())
    {
        file.Delete();
    }
    tsNotify.Text = "Remote files at \\server\\SomeErrors deleted";
    textBox1.Text = "";
    listBox1.Items.Clear();
}

As with all of my projects, it is out of necessity and not code pretty in any way. Its functional for my needs and serves it purpose though. Code in my project is freely found around the internet by performing simple Google searches or hitting Microsoft’s programming help sites. If the code benefits anyone then awesome. I take credit for nothing more than the tool I have created to accomplish a task.

cURL Project Application in C#

Over time, and out of necessity, I have been throwing together a few different tools over time using C# to help me cut down the amount of time it takes me to do certain things. As I create those I make them code generic so as to not include anything company wise and I like to share those projects so that others learning to code or who might be searching for code or projects might find them beneficial.

Testing in a QA/eCommerce environment, I use cURL pretty much daily for clearing Varnish cache when testing web pages across different staging servers and was curious about creating a wrapper for the curl.exe in C# that I could include within my standalone portable tool set. I did some research and although I did find several great resources I chose to stick it out with a Microsoft article I found titled How To Write a Wrapper for a Command-Line Tool with Visual C# .NET. The article gives a great explanation of creating a class file and adding it to a project.

The cURL Project Application

My cURL project simply needs to fulfill the function of issuing a purge command at whatever URL I give it so its geared with this specific task in mind. Normally I would stick the folder to the curl.exe file in my system PATH environment variable, open a command prompt and issue curl -X PURGE v1.cms.servername.com but the task, for me, is to include this within my “forms” tools that I created in c#.

Using CMD

Issuing the curl -X PURGE command in the command console would yield the results below. I just need the same thing in my tool which I can do using return “\t” + output in the class file start.cs.

 

curl_cmd

Using cURL Tool

The URL field is for the URL on the server I want to clear varnish cache for. The drop down boxes, respectively, retain the switch and command that I want to issue. The PURGE button simply runs the app and on success will give me a message dialog as shown in the second cURL graphic below. the drop down boxes actually have the switch and command respectively listed three times via loading an array in the form_load event. It’s because I may add a couple more items. If not then I will break these out of an array but still load them in the event.

Code follows as well as a ZIP file to download that contains the project and executable. Enjoy.

mycurlapp

 

curl_app01

Start.cs file based on the Microsoft article

The class can really just get by with using System.Diagnostics and using System.IO. Modify as you see fit for your project.

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Diagnostics;
using System.IO;

namespace MyCurlApp
{
    public class start
    {
        internal static string Run(string exeName, string argsLine, int timeoutSeconds)
        {
            StreamReader outputStream = StreamReader.Null;
            string output = "";
            bool success = false;

            try
            {
                Process newProcess = new Process();
                newProcess.StartInfo.FileName = exeName;
                newProcess.StartInfo.Arguments = argsLine;
                newProcess.StartInfo.UseShellExecute = false;
                newProcess.StartInfo.CreateNoWindow = true;
                newProcess.StartInfo.RedirectStandardOutput = true;
                newProcess.Start();

                if (0 == timeoutSeconds)
                {
                    outputStream = newProcess.StandardOutput;
                    output = outputStream.ReadToEnd();
                    newProcess.WaitForExit();
                }
                else
                {
                    success = newProcess.WaitForExit(timeoutSeconds * 1000);

                    if (success)
                    {
                        outputStream = newProcess.StandardOutput;
                        output = outputStream.ReadToEnd();
                    }
                    else
                    {
                        output = "Timed out at " + timeoutSeconds + " seconds waiting for " + exeName + " to exit.";
                    }
                }

            }
            catch (Exception e)
            {
                throw (new Exception("An error occurred running " + exeName + ".", e));
            }
            finally
            {
                outputStream.Close();
            }
            return "\t" + output;

        }
    }
}

Main Form1.cs

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Windows.Forms;
using System.Net;

namespace MyCurlApp
{
    public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();
        }

        private void Form1_Load(object sender, EventArgs e)
        {
            string[] myList = new string[3];
            myList[0] = "-X";
            myList[1] = "-X";
            myList[2] = "-X";
            comboBox1.Items.AddRange(myList);
            comboBox1.SelectedIndex = 0;

            string[] myList2 = new string[3];
            myList2[0] = "PURGE";
            myList2[1] = "PURGE";
            myList2[2] = "PURGE";
            comboBox2.Items.AddRange(myList2);
            comboBox2.SelectedIndex = 0;

        }

        private void button1_Click(object sender, EventArgs e)
        {
            try
            {

                string output;
                string arg1 = comboBox1.Text;
                string arg2 = comboBox2.Text;
                string arg3 = textBox1.Text;

                // My draconian error control. maybe use a switch case if ever using another paramter
                // than -X. I only purge with this.

                if (textBox1.Text == "" || comboBox1.SelectedItem.ToString() == null || comboBox2.SelectedItem.ToString() == null)
                {
                    //MB works good if textBox is empty
                    MessageBox.Show("Select a valid parameter or URL!");
                    return;
                }
                else
                {
                    // run if all is cool
                    output = start.Run("Curl.exe", " " + arg1 + " " + arg2 + " " + textBox1.Text, 10);
                    MessageBox.Show(output + "If blank, check the URL");
                }

            }
                //If fields are blank otherwise show any exceptions
                //Should always include basic try and catch in case an error occurs
                catch (NullReferenceException ex)
                {
                    MessageBox.Show("\nPerhaps you forgot to select something?\n" + ex.Message);
                }

                catch (Exception ex)
                {
                    MessageBox.Show("Well this isn't good! " + "\r\n" + ex.Message);
                }

        }

        private void button2_Click(object sender, EventArgs e)
        {
            string getcb = Clipboard.GetText(TextDataFormat.UnicodeText);
            textBox1.Text = getcb;
        }

    }
}

REFERENCES

Microsoft – How To Write a Wrapper for a Command-Line Tool with Visual C# .NET

DOWNLOAD

C# Programming out of need – Getting IP and MAC Addresses

getipmac-ipThis is my C Sharp project for getting IP and MAC addresses. Code kudos go out to respective developers and websites, such as MSDN, stackoverflow, C# Corner and others I have left out but all code is in the public domain and modified by me to fit my needs.

As far as I am concerned my projects are as-is and there is nothing code efficient in my projects so please don’t beat me up too bad over any of it.

I am not a professional programmer in any regard but do consider myself a coder, this is just my slapped together get ‘er done tool befitting my personal need. If you want to comment on my project offline send me an email at stevegossett (AT) outlook.com

NOTE
You will need .NET 4.5 installed for this C# project and you will need to right-click on the
project name from within the IDE and add references to:

  • System.Management
  • System.Management.Instrumentation


Download Project