C# Project – Scraping Files for Data

I was literally receiving thousands of emails in Outlook with regard to a specific network service error, many duplicates, that quickly surpassed my ability to copy, paste and manually process that I got fed up and looked for a more creative way to handle it.

At first I thought about creating a macro in Outlook to programmatically scrape the emails and save information they contained but macros are a security hazard and actually locked down by a network policy anyways. Alternatively I opted to make a utility in C#.

Creating a utility to access Office COM objects, namely Outlook, did nothing for me but try my patience and then I thought why not have the emails saved as text files to a folder as they come in. After explaining my intentions to the network admin, he set up a script on the mail server to save specific emails to a share as text files. Excellent.

So now with the emails being saved as plain text files, I could create a text scraper to pull out data I needed and process. However, the data I needed was sandwiched inside an error with text that ran contiguous (ex: errorfound:D2A1AB9C83=|0X0C0MID:3:3) so I had to process the files by searching for a string and grabbing the text between beginning and ending characters. What I needed was simply in between the colon and the pipe characters which were always static in what they contained. Below is the parser function and button function code where I added to save the list to a file also.

[code language=”csharp”]
//String parser
public string ParseBetween(string Subject, string Start, string End)
{
return Regex.Match(Subject, Regex.Replace(Start, @"[][{}()*+?.\\^$|]", @"\$0") + @"\s*(((?!" + Regex.Replace(Start, @"[][{}()*+?.\\^$|]", @"\$0") + @"|" + Regex.Replace(End, @"[][{}()*+?.\\^$|]", @"\$0") + @").)+)\s*" + Regex.Replace(End, @"[][{}()*+?.\\^$|]", @"\$0"), RegexOptions.IgnoreCase).Value.Replace(Start, "").Replace(End, "");
}

//Parse between two strings and grab that contents as new string
private void button1_Click(object sender, EventArgs e)
{
textBox1.Clear();
tsNotify.Text = "";
StringBuilder strFile = new StringBuilder();
string s2 = "errorfound:"; //beginning string
string s3 = "|0X0C0MID:3:3"; //end string
//files to parse
foreach (string file in Directory.EnumerateFiles(@"\\server\SomeErrors\", "*.txt"))
{
string contents = File.ReadAllText(file);
string strParsed = ParseBetween(contents, s2, s3);
//Clean up the string
string clean = Regex.Replace(strParsed, "[^A-Za-z0-9 ]", "");
textBox1.AppendText(clean + "\r\n");
}
using (StreamWriter objWriter = new StreamWriter(@"C:\ServerErrors.txt"))
{
objWriter.Write(textBox1.Text);
objWriter.Flush();
tsNotify.Text = "Parsed and saved to C:\\ServerErrors.txt";
}
}
[/code]

As the data got pulled from the text files, I experienced some string anomalies such blank lines (sometimes several in a row) and white spaces because not all of the emails with the title being saved related to the error so those emails made it to the list as blank entries. Trimming the strings helped with that.

were showing up as blank entries in the list.

[code language=”csharp”]
private void button5_Click(object sender, EventArgs e)
{
string filePath = "C:\\ServerErrors.txt";
//remove any empty lines
string[] lines = File.ReadAllLines(filePath).Where(s => s.Trim() != string.Empty).ToArray();
listBox1.Items.AddRange(lines);
tsNotify.Text = "Errors added for processing";
}
[/code]

Once processed, I removed any duplicate data using the code below:

[code language=”csharp”]
private void button6_Click(object sender, EventArgs e)
{
string[] arr = new string[listBox1.Items.Count];
listBox1.Items.CopyTo(arr, 0);
var arr2 = arr.Distinct();
listBox1.Items.Clear();
foreach (string s in arr2)
{
string clean = Regex.Replace(s, "[^A-Za-z0-9 ]", "");
listBox1.Items.Add(clean);
}
tsNotify.Text = "Duplicates removed";
}
[/code]

Once finished, I processed a final list of errors that were minus white spaces, blanks lines and duplicates.

[code language=”csharp”]
private void button7_Click(object sender, EventArgs e)
{
foreach (object liItem in listBox1.Items)
textBox2.Text += liItem.ToString() + "\r\n";
tsNotify.Text = "Final list ready to copy";
}
[/code]

After a few iterations I was successful at scraping and processing the data I needed and since each time I wanted to process only newer text files, at the end of each run, I would delete the files on the server share.

[code language=”csharp”]
private void button2_Click(object sender, EventArgs e)
{
System.IO.DirectoryInfo di = new DirectoryInfo(@"\\server\SomeErrors\");
foreach (FileInfo file in di.GetFiles())
{
file.Delete();
}
tsNotify.Text = "Remote files at \\server\\SomeErrors deleted";
textBox1.Text = "";
listBox1.Items.Clear();
}
[/code]

As with all of my projects, it is out of necessity and not code pretty in any way. Its functional for my needs and serves it purpose though. Code in my project is freely found around the internet by performing simple Google searches or hitting Microsoft’s programming help sites. If the code benefits anyone then awesome. I take credit for nothing more than the tool I have created to accomplish a task.

Tagged on: , ,